Push me, pull me… or just pull me out.
— Pearl Jam, Push Me, Pull Me
Currently, there are three popular ways of parsing XML: The DOM, the
SAX and Pull APIs.
DOM is probably the most well known, and it did a great job showing
off XML’s slowness. This is because DOM reads the full XML into the
core as a tree structure. The nodes of the tree mainly consist of
Element nodes and Text nodes. DOM is very easy to use, but also
very inefficient. Ruby’s REXML is mainly used in the DOM mode;
although it’s API is rubyfied and made a lot nicer, the core idea is
the same: provide a tree structure to XML documents.
The downsides of DOM (and the complexity of implementation) turned out
pretty quickly, and soon the Simple API for XML (SAX) was
implemented in Java. SAX is a streaming API, the SAX parser scans
over the document and emits events. For example, the simple document
<doc>This is a <em>very</em> simple XML document.</doc>
emits these events (simplified):
start_element "doc"
text "This is a "
start_element "em"
text "very"
end_element "em"
text " simple XML document."
end_element "doc"
Usually, the program defines a new class whose methods get called by
the SAX parser.
SAX is very efficient (the API is quite like the Expat one, which
is well-known being the fastest XML parser on earth), but can be
amazingly complex to use. (Personally, I don’t have a hard time
with it—at least for easy stuff—, but there are people that don’t
like state machines.)
REXML also provides a SAX interface, which is quite fast, but of
course not like the real thing. If you want absolute performance, I
still can only recommend the Ruby Expat bindings. Though the
binding was not updated recently (my version is from 2002), it
runs like hell.
The last, and the IMO most interesting way to parse XML is a pull API.
As usual, the first implementation was written in Java (XPP1, I think)
and the API was only informally (i.e. by implementation) defined. The
XmlPull API changed that. For Java, there now exist a bunch of
XML pull libraries that can all be accessed using the same way.
REXML provides a Pull API, too; but has been marked experimental and
they don’t recommend production use. Besides, it has problems with
namepaces (which is true for REXML in general. :-/).
Although being technically one step lower than SAX, Pull APIs are much
more intuitive to program. The Pull API requires you to fetch each
event by yourself, therefore you don’t need a state machine as you can
simply get the next events when you need them, not when they arrive.
This can be shown best at a simple example:
My old homepage provided an index facility, where words marked in
idx
and hidx
(for hidden index) are aggregated into a special
page. I implemented the program first using REXML and its DOM API.
The page to link to can be found out getting the name
attribute of
the page
root node. I basically needed an index word to page hash.
Here is the parsing code:
doc = Document.new(File.open(f))
htmlfile = XPath.first(doc, "/page/@name").to_s
["//idx", "//hidx"].each { |xp|
XPath.each(doc, xp) { |child|
word = child.text.squeeze(" ").strip
index[word] ||= []
index[word] << htmlfile
}
}
This program was a total performance hog, and intolerable to run (for
me, at least). I rewrote it using expat
:
class IdxParser < XML::Parser
attr_accessor :index
def initialize(encoding = nil, nssep = nil)
@save = false
@word = ""
@htmlfile = ""
end
def startElement(nsname, attr)
ns, name = nsname.split(";", 2)
if ns == $IDXNS && (name == "idx" || name == "hidx")
@save = true
@word = ""
elsif name == "page" && attr["name"]
@htmlfile = attr["name"]
end
end
def endElement(nsname)
ns, name = nsname.split(";", 2)
if ns == $IDXNS && (name == "idx" || name == "hidx")
@save = false
@word = @word.squeeze(" ").strip
@index[@word] ||= []
index[@word] << @htmlfile
end
end
def character(t)
@word << t if @save
end
end
This program works very fast, but it was a bitch to write and lots
longer, of course. For complex formats, SAX gets darn complicated,
trust me.
Finally, the elegant pull API (this implementation uses REXML’s Pull
API which has a bit weird API):
parser = REXML::Parsers::PullParser.new(io)
htmlfile = nil
parser.each { |event|
if event.start_element? && event[0] == "page"
htmlfile = event[1]["name"]
end
if event.start_element? && event[0] =~ /idx:h?idx/
word = parser.squeeze(' ').strip
(index[word] ||= []) << htmlfile
end
}
Since the REXML Pull API didn’t fully satisfy my needs, I decided to
rewrite a pull parser for ruby from scratch (Heck, I even spent a day
hacking it in pure Ruby, but soon gave up.), based on the Pull API
exposed by libxml2. I think above example will read something like
this when it’s done:
parser = FastPull.new(io)
htmlfile = nil
while parser.pull
if parser.start_element? && parser.name == "page"
htmlfile = parser.attributes["name"]
end
if parser.start_element? && parser.name =~ /{#{$IDXNS}}h?idx/
word = ""
while parser.pull.text?
word << parser.text
end
(index[word.squeeze(' ').strip] ||= []) << htmlfile
end
}
Now, you have the easiness of DOM and the speed of SAX. :-) (And I can
look at XML again.)
NP: Pearl Jam—Push Me, Pull Me