Playing a bit with Nukumi2, I discovered a very interesting thing: If I disable BlueCloth, my pages suddenly render a lot slower. “This ain’t possible”, I stumbled. BlueCloth is fairly fast, but still the slowest component of the rendering stage. “What the hell is going on?”
I timed the rendering of a few entries by themselves, they were all faster than with BlueCloth. Well, all except one. Yesterday’s. I trimmed it a bit, and finally found out that if I removed that part, it would run at full speed again:
class Class
def attr_inject(attribute)
(@__attributes_to_inject ||= []) << attribute
end
end
I ran a profiler (ruby -r profile
), and it turned out String#scan
was taking quite long… Where is it used? In RubyPants, look at the code:
def tokenize
tag_soup = /([^<]*)(<[^>]*>)/
tokens = []
prev_end = 0
scan(tag_soup) {
tokens << [:text, $1] if $1 != ""
tokens << [:tag, $2]
prev_end = $~.end(0)
}
if prev_end < size
tokens << [:text, self[prev_end..-1]]
end
tokens
end
Ok, it’s a part of the HTML tokenizer… The only reason scan
can be
slow is the regular expression, let’s have a closer look:
tag_soup = /([^<]*)(<[^>]*>)/
Well, it looks alright: First match any preceding text and then a tag. After toying around for fifteen minutes with that regexp, trying various tweaks (e.g. making it non-greedy), I built a proper test case. That string took very long, ~6 seconds to parse it 100 times:
<hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh...
Even more h
to come. I remembered a scary issue about
regular expressions: Under certain circumstances, backtracking can make
Regexps run in exponential time (various string sizes showed that)
because they try every possible match.
It really looked like that was the case… But where the hell does it backtrack? There are no alternations, and the match should run linearly. Well, it should. But it didn’t. It turned out that it really tries every possible match, because the match isn’t anchored.
D’oh! Of course, there is no anchor. Luckily, I remembered \G
from
back when I used Perl. man perlre
says:
\G Match only at pos() (e.g. at the end-of-match position
of prior m//g)
Hey, cool. That should be exactly what we need. Make that line simply
tag_soup = /\G([^<]*)(<[^>]*>)/
and off it goes! Blazingly fast. :-)
Who would have thought two characters can make such a big difference?
(Of course, I’d have never discovered that “bug” if I only ran it on
correct input (as >>
is not valid HTML), so test your code with
“invalid” data too!)
NP: Sum 41—We’re All To Blame