Using hashes of (mostly) anonymous subroutines, it is surprisingly easy to write XML translators with the Perl DOM interface.
If you iterate recursively over a DOM tree (such as one provided by the XML::DOM Perl module), you need something like a computed goto
to dispatch on node types, and, probably more important, different elements. For example, consider the structure-oriented DTD from Notes on XML DTD Design (including the IMAGE
element), and an analyzer for the VBOX
element. Written in purely imperative style, it could look like this:
my $node = shift; # We assume that $node itself has already been visited. foreach my $child ($node->getChildNodes) { if ($child->getNodeType == TEXT_NODE) { # We will probably want to ignore this if it is # whitespace (an signal an error otherwise). } elsif ($child->getNodeType == ELEMENT_NODE) { if ($child->getNodeName eq 'HBOX') { process_HBOX ($child); } elsif ($child->getNodeName eq 'LIST') { process_LIST ($child); } elsif ($child->getNodeName eq 'IMAGE') { process_IMAGE ($child); } else { # Signal an error. } } else { # Process comments etc. } }
Clearly, this is unacceptable. The syntactic overhead is huge. What we really want is some case statement à la Ada (but also for strings).
for Child in getChildNodes (Node) loop case getNodeType (Child) is when TEXT_NODE => -- Process as above. when ELEMENT_NODE => case getNodeName (Child) is when "HBOX" => Process_HBOX (Child); when "LIST" => Process_LIST (Child); when "IMAGE" => Process_IMAGE (Child); when others => -- Signal an error. end case; when others => -- As above. end case; end loop;
This Ada-inspired syntax immediately suggests to use Perl hashes.
First of all, we note that a TEXT_NODE
does not require radically different processing at this level, so we can work with a single case
statement. We choose to represent text nodes with an element name ' TEXT '
(with a leading and trailing space), so our dispatch table could look like this:
' TEXT ' => process_Text_Node 'HBOX' => process_HBOX 'LIST' => process_LIST 'IMAGE' => process_IMAGE
With a small convenience routine called dispatchingName
, we get the following Perl snippet.
my %dispatch = ( ' TEXT ' => \&process_Text_Node, HBOX => \&process_HBOX, LIST => \&process_LIST, IMAGE => \&process_IMAGE, ); my $node = shift; foreach my $child ($node->getChildNodes) { &{$dispatch{dispatchingName ($child)}} ($child); }
The last thre lines are generic and can be put into a convenience routine, and we arrive at the code below.
my $node = shift; dispatchChildNodes ($node, { ' TEXT ' => \&process_Text_Node, HBOX => \&process_HBOX, LIST => \&process_LIST, IMAGE => \&process_IMAGE, });
This is quite hard to beat, but one detail is still missing.
If you want to write a fairly generic kind of XML translator with node-by-node processing, you need three pieces of information at each node:
the source node,
the target node (where processed information should go),
state.
This means that we have to enhance our code a bit.
my ($source, $target, $state) = @_; dispatchChildNodes ($source, $target, $state, { ' TEXT ' => \&process_Text_Node, HBOX => \&process_HBOX, LIST => \&process_LIST, IMAGE => \&process_IMAGE, });
Consequentially, the implementation of process_HBOX
should start with the following code:
sub process_HBOX { my ($source, $target, $state) = @_; # probably some invocation of dispatchChildNodes }
Each processing routine should examine $source
, process it (consulting and updating $state
as required), and append the result to $target
.
A neat aspect of the hash dispatch is that it plays so well with anonymous subroutines. Suppose that in our example, processing for the IMAGE
element is so simple that we do not want to defer it to a subroutine. We can use an anonymous subroutine instead.
my ($source, $target, $state) = @_; dispatchChildNodes ($source, $target, $state, { ' TEXT ' => \&process_Text_Node, HBOX => \&process_HBOX, LIST => \&process_LIST, IMAGE => sub { my ($source, $target, $state) = @_; # Process as usual. }, });
The my
statement is required becase Perl does not properly support closures. For the same reason, you have to use the $state
to pass information from the surrounding code to the nested subprogram.
Why do we not use Perl's introspection capability, mangle the element name in some way to get a subroutine name, and call that subroutine directly? First of all, all subroutines would have to be named. Second, if we operate in the "Best of Both Worlds" scenario of Notes on XML DTD Design, some tag names could be overloaded, and we have to dispatch to different subroutines, depending on the enclosing elements. This is trivially handled by our hashes approach.
In general, we think that the name-based subroutine dispatch is not a good idea. It might offer some reduction in terms of code size, but even this is not clear (because anonymous subroutines are not available, for example).
Have you ever looked at the source code of these pages and wondered which content management system would generate such clean XHTML, and why anyone would use it? Here is the answer.
This document describes some of the design decisions underlying the XML processing tools for this web site, and their influence on DTD design.
XML Processing with DOM and Perl
Using hashes of (mostly) anonymous subroutines, it is surprisingly easy to write XML translators with the Perl DOM interface.
2003-07-31: published