Continued from part 2 of N…
After working around the hidden commandline ordering of mwdumper and going home while the database was still being loaded, I walked in today to find 4.88 million pages and counting. No categorylinks, no pagelinks, etc. I’m hoping that those get populated before the job is complete, because aside from the page data, we’re using wikipedia categories to do a crude classification of entities, and pagelink data to construct a graph of pages and their connections.
Based on the schema documentation, I’ve constructed an ActiveRecord based model of the tables that I’m interested in. Using ActiveRecord w/o Rails is pretty easy:
(1) you need to explicitly connect to the database. ActiveRecord maintains a copy of the connection to use for all derived classes once this is done:
ActiveRecord::Base.establish_connection(
:adapter=>’mysql’,
:host=>’localhost’,
:database=>’wikipedia_test’,
:username=>’arun’,
:password=>’arun’)
(2) You can then subclass the ActiveRecord::Base class as usual:
class Page < ActiveRecord::Base
set_table_name ‘page’
set_primary_key ‘page_id’
end
Pretty (yawn!) straightforward so far. So I decided to mix it up a little by writing my tests in rpsec. I’ve been intrigued by Behavior Driven Development, aka BDD, and wanted to see how useful it would be in my day to day work. As I’ve said may times before, I’m a big fan of TDD. However I still find it an effort to write tests first. When I do, it feels great. But I often slip into a rut where the tests dont drive my coding as much as they should, and as a result I write unnecessary code.
Perhaps the biggest advantage of rpsec over XUnit type unit testing is that I get to specify my object behavior prior to writing a single line of code for the object in something that looks / feels like English, which is something I’m fairly fluent in. Here is the starter rspec I wrote prior to actually coding the wikipedia model classes: I was treating those classes as a singular entity (aka ‘the model’) at this point:
describe Page do
it ’should retrieve a valid page’ do
end
it ’should retrieve a valid text’ do
end
it ’should retrieve a valid revision’ do
end
it ’should retrieve the latest text associated with a page via the associated revision’ do
end
it ’should retrieve all associated CategoryLinks for a Page’ do
end
it ’should retrieve all associated PageLinks for a Page’ do
end
end
This allows me to really get my head around what the model should be capable of, then using the object extensions should and should_not, I can validate those assertions:
it ’should retrieve the latest text associated with a page via the associated revision’ do
test_text = “test text for page_1″
@text = Text.new@text.attributes = load_text_attribs(test_text)
@text.save@page = Page.new
@page.attributes = load_page_attribs(‘page_1′)
@page.save@revision = Revision.new
@revision.attributes = load_revision_attribs(@page.page_id,@text.old_id)
@revision.savelatest_rev = @page.get_latest_revision
latest_rev.should_not eql(nil)
text = latest_rev.text
text.should_not eql(nil)
text.old_text.should eql(test_text)
end
In order to get ActiveRecord Models loading in RSpec, I did a couple of things:
(1) I created a connection before running any tests. In rspec, the before(:all) method allows me to specify a block of code that runs prior to any test execution.
before(:all) do
ActiveRecord::Base.establish_connection(
:adapter=>’mysql’,
:host=>’localhost’,
:database=>’wikipedia_test’,
:username=>’arun’,
:password=>’arun’)end
(2) I made sure that data was getting deleted prior to every test by using the before() method with the :each symbol:
before(:each) do
#clear out all data
…
end
One significant difference between Rspec and TestUnit is the lack of Fixture support. I was shoving in fixture support, aka putting a square peg into a round hole, when I decided to see how Rspec users felt about fixtures. Turns out they are not particularly fond of them. One way to do load test code w/o fixtures is to include model specific helper classes to load data.
module PageHelper
def load_page_attribs(title)
{
:page_title=>title,
:page_namespace=>2,
:page_random=> 0,
:page_touched=> 0,
:page_latest=> 0,
:page_len=> 200
}
end
end
…
page = Page.new
page.attributes = load_attributes
page.save
In the end, even though it co-mingles data with code, which is commonly perceived as a bad thing to do — hence fixtures! — the less code approach seems more manageable b/c I can make the modification directly in the spec file, and the use of Ruby symbols makes the code read as easy as a fixtures file.
More wikipedia processing progress, same bat time, same bat channel!
January 30, 2008 at 7:59 pm
[...] No matter where I go, there I am sometimes that’s nice, sometimes not so much… « Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer [...]