It is interesting that in the “real world”, most things go easier into one direction than in the other. For example, it is easier to break a coffee cup than to glue it together again. I think this has something to do with entropy.
Computer programs are not unaffected by this fact, take for example the two styles of on-the-fly XML parsing, Push vs. Pull. Translating a Pull API into a Pull API is done trivially, such code usually looks like that:
while parser.pull
case parser.state
when :open
receiver.open parser.text
when :text
receiver.text parser.text
when :close
receiver.close
end
end
Doing the reverse—converting a Push API into a Pull one—looks impossible at first, but that is not true. It can be done (albeit only in limited ways in Ruby, see below) using our beloved continuations.
I have written a small class, Stream2Pull
that does this crazy task:
class Stream2Pull
attr_reader :data
def initialize(&block)
@data = nil
@continuation = lambda { instance_eval &block }
end
def method_missing(*arguments)
@data = arguments
@again = true
switch
end
def pull
@again = false
switch
@again
end
def switch
save = @continuation
callcc { |@continuation| save.call }
save = nil
GC.start
end
end
Stream2Pull
takes a block it will call on itself on the first
execution of pull
. This code is expected to send undefined messages
to the instance. Then the “magic” happens: With the call of switch
,
the context of the program changes from method_missing
to the call of
pull
, which then returns. Now the caller can examine the undefined
message—the event—by looking at data
. Let’s wrap REXML’s
StreamParser
that way:
s_to_p = Stream2Pull.new {
REXML::Parsers::StreamParser.new(<<XML, self).parse
<?xml version="1.0"?>
<info>abc<em>hehe</em></info>
}
Now, we can use our new object just like a pull parser:
while s_to_p.pull
p s_to_p.data
end
We just repeatedly pull from the parser, thereby always switching
contexts between the inner loop of the StreamParser
and ours.
These are the events that are successfully emitted:
[:xmldecl, "1.0", nil, nil]
[:tag_start, "info", {}]
[:text, "abc"]
[:tag_start, "em", {}]
[:text, "hehe"]
[:tag_end, "em"]
[:tag_end, "info"]
[:text, "\n"]
We just have converted a Push API into a Pull API!
Now, we can talk about the dark side of the medal. Ruby’s
continuations are very slow and they eat loads of memory. When I
tried parsing a 1.5 megabyte file—the largest one I could find on my
disk, it is actually the XSL specification—I didn’t yet have the
GC.start
inside switch
. In about 2 minutes, my Ruby gobbled 400
megabytes of memory! After enforcing garbage-collection, it took less
memory but memory usage was still raising constantly. After 15
minutes and 100 megabytes used I gave up.
These are the things I really want to see in (I hope not too far away) future Ruby 2 implementations: Efficient tail calls, continuation passing style, and possibly support for partial continuations. If you implement CPS, you can get the others almost for free. Those features would make above code run at about the speed of a “native” pull parser. Alternatively, please port the Ruby standard library to Scheme. }}:-) Just kidding.
If you just want to parse XML, it’s all not too bad, actually. This
is because REXML’s StreamParser
*is* actually a wrapper around a
PullParser (REXML::BaseParser
)! (Of course, I knew that in advance.)
Therefore, we can easily write:
bp = REXML::Parsers::BaseParser.new(ABOVE_XML)
while bp.has_next?
p bp.pull
end
And it will generate exactly the same output as above. :-) The main difference is, well, it uses constant memory and only needs 15 seconds to parse the big XML document. And that’s not too bad for a pure Ruby implementation.
NP: Bob Marley—Three Little Birds