Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer

Continued from part 2 of N

After working around the hidden commandline ordering of mwdumper and going home while the database was still being loaded, I walked in today to find 4.88 million pages and counting. No categorylinks, no pagelinks, etc. I’m hoping that those get populated before the job is complete, because aside from the page data, we’re using wikipedia categories to do a crude classification of entities, and pagelink data to construct a graph of pages and their connections.

Based on the schema documentation, I’ve constructed an ActiveRecord based model of the tables that I’m interested in. Using ActiveRecord w/o Rails is pretty easy:

(1) you need to explicitly connect to the database. ActiveRecord maintains a copy of the connection to use for all derived classes once this is done:

ActiveRecord::Base.establish_connection(
:adapter=>’mysql’,
:host=>’localhost’,
:database=>’wikipedia_test’,
:username=>’arun’,
:password=>’arun’)

(2) You can then subclass the ActiveRecord::Base class as usual:

class Page < ActiveRecord::Base
set_table_name ‘page’
set_primary_key ‘page_id’
end

Pretty (yawn!) straightforward so far. So I decided to mix it up a little by writing my tests in rpsec. I’ve been intrigued by Behavior Driven Development, aka BDD, and wanted to see how useful it would be in my day to day work. As I’ve said may times before, I’m a big fan of TDD. However I still find it an effort to write tests first. When I do, it feels great. But I often slip into a rut where the tests dont drive my coding as much as they should, and as a result I write unnecessary code.

Perhaps the biggest advantage of rpsec over XUnit type unit testing is that I get to specify my object behavior prior to writing a single line of code for the object in something that looks / feels like English, which is something I’m fairly fluent in. Here is the starter rspec I wrote prior to actually coding the wikipedia model classes: I was treating those classes as a singular entity (aka ‘the model’) at this point:

describe Page do

it ‘should retrieve a valid page’ do

end

it ‘should retrieve a valid text’ do

end

it ‘should retrieve a valid revision’ do

end

it ‘should retrieve the latest text associated with a page via the associated revision’ do

end

it ‘should retrieve all associated CategoryLinks for a Page’ do

end

it ‘should retrieve all associated PageLinks for a Page’ do

end

end

This allows me to really get my head around what the model should be capable of, then using the object extensions should and should_not, I can validate those assertions:

it ‘should retrieve the latest text associated with a page via the associated revision’ do
test_text = “test text for page_1”
@text = Text.new

@text.attributes = load_text_attribs(test_text)
@text.save

@page = Page.new
@page.attributes = load_page_attribs(‘page_1’)
@page.save

@revision = Revision.new
@revision.attributes = load_revision_attribs(@page.page_id,@text.old_id)
@revision.save

latest_rev = @page.get_latest_revision
latest_rev.should_not eql(nil)
text = latest_rev.text
text.should_not eql(nil)
text.old_text.should eql(test_text)
end

In order to get ActiveRecord Models loading in RSpec, I did a couple of things:

(1) I created a connection before running any tests. In rspec, the before(:all) method allows me to specify a block of code that runs prior to any test execution.

before(:all) do
ActiveRecord::Base.establish_connection(
:adapter=>’mysql’,
:host=>’localhost’,
:database=>’wikipedia_test’,
:username=>’arun’,
:password=>’arun’)

end

(2) I made sure that data was getting deleted prior to every test by using the before() method with the :each symbol:

before(:each) do

#clear out all data

end

One significant difference between Rspec and TestUnit is the lack of Fixture support. I was shoving in fixture support, aka putting a square peg into a round hole, when I decided to see how Rspec users felt about fixtures. Turns out they are not particularly fond of them. One way to do load test code w/o fixtures is to include model specific helper classes to load data.

module PageHelper

def load_page_attribs(title)
{
:page_title=>title,
:page_namespace=>2,
:page_random=> 0,
:page_touched=> 0,
:page_latest=> 0,
:page_len=> 200
}
end

end

page = Page.new

page.attributes = load_attributes

page.save

In the end, even though it co-mingles data with code, which is commonly perceived as a bad thing to do — hence fixtures! — the less code approach seems more manageable b/c I can make the modification directly in the spec file, and the use of Ruby symbols makes the code read as easy as a fixtures file.

More wikipedia processing progress, same bat time, same bat channel!

Advertisements

2 Responses to Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer

  1. […] No matter where I go, there I am sometimes that’s nice, sometimes not so much… « Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer […]

  2. אינסטלטור בתל אביב…

    […]Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer « Wherever I go, there I am[…]…

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: