Learned a little bit about scRUBYt! today (damn thats hard to type!).

It looks like a pretty cool way to do web crawlers. To use it, you define a "learning" crawler using the scRUBYt! DSL, which combined with the actual site at a specific point in time, creates the real crawler to do the dirty work - essentially, example text gets converted to xpath expressions. The cool thing about this is that when the site changes, one has to do minimal changes to create a new working crawler, not to mention that creating the crawler in the first place is a lot easier.



The examples on the site are a little out of date, but here is a simple one which shows the basics. The learning crawler learns how to extract URLs on a google search for 'ruby', but then the production version ends up being able to crawl the urls for any search term. It also automatically crawls to the next page by looking for the 'Next' link.

The learning crawler that you write:

require 'rubygems'
require 'scrubyt'


google_data = Scrubyt::Extractor.define do
fetch 'http://www.google.com/ncr'
fill_textfield 'q', 'ruby'
submit

link "Ruby Programming Language/@href"
next_page "Next", :limit => 2
end

puts google_data.to_xml

Becomes for production use:


require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
fetch "http://www.google.com/ncr"
fill_textfield "q", "ruby"
submit

link "/html/body/div/div/div/a"

next_page "Next",:limit => 2
end

google_data.to_xml.write($stdout, 1)


And outputs:


http://www.ruby-lang.org/
http://www.ruby-lang.org/en/20020101.html
http://en.wikipedia.org/wiki/Ruby_programming_language
http://en.wikipedia.org/wiki/Ruby
http://www.rubyonrails.org/
http://www.rubycentral.com/
http://www.rubycentral.com/book/
http://www.youtube.com/watch?v=JMDcOViViNY
http://www.w3.org/TR/ruby/
http://www.zenspider.com/Languages/Ruby/QuickRef.html
http://poignantguide.net/
http://www.rubynz.com/
http://www.ruby-doc.org/
http://tryruby.hobix.com/
http://www.rubycentral.org/
http://www.gemstone.org/gem-by-gem/english/ruby.html
http://whytheluckystiff.net/ruby/pickaxe/
http://intertwingly.net/blog/
http://lotusmedia.org/
http://www.oreillynet.com/ruby/




One of the authors showed me a good example which I don't think is linked to from anywhere yet, so here it is in case you are interested.

We'll see if this lives up to its promise in real world usage, but looking good so far. One major drawback is that it breaks on heavy AJAX sites e.g. clicking a paging link which actually loads the next page in place using an ajax call. However, one can usually find a non-ajax workaround to get to the same data. The authors are aware of this and planning on integrating somethign like FireWatir to allow it to handle ajax/dhtml requests in a seamless fashion.