leah blogs

December 2005

05dec2005 · Tracing websites for fun (and profit?)

I recently stumbled on GotToZ.com, which really is an waste of time but I looked like a challenge, so I tried it for some minutes. The purpose of the site is to tangle through a web of sites, each representing a letter of the alphabet, finally reaching Z. I got up to Y manually, but then I decided Ruby could do that much better than me. I encourage you to try it manually first, though. (Not like you’re wasting enough time already…)

require 'open-uri'

@pages = {}
@count = Hash.new 0

def track(page)
  return  if @count[page] > 2
  @count[page] += 1

  a = (@pages[page] ||= [])
  open(page).read.scan(
      /\074A href="(http:\/\/.*?.com\/)".*?\076[A-Z]\074/m) { |e|
    a.push e.first
  }
  a.uniq!
  a.sort!
  STDERR.puts page
  a.each { |x| track x }
end

track 'http://www.amongothers.com/'

puts "digraph {"
@pages.each { |k, v|
  v.each { |l| puts %Q[  "#{k}" -> "#{l}";] }
}
puts "}"

Run it, possibly a few times because I think the Z site is added randomly (that’s why every page is fetched up to three times, too), and save the standard output into a file. Now, you have a nice graph you can run GraphViz on and do nifty diagrams, like this (click for full 3884x3434 view, be careful):

Circo graph of GotToZ.com

NP: Dire Straits—Walk Of Life

Copyright © 2004–2022