Installing Wikipedia part 4 of N: getting additional wikipedia metadata

January 30, 2008

Continued from yesterday:

The loading of ‘page’ table data  finished after approx 16 hours on a 2.2GHz dual proc, 4GB machine with approx 6.5 million page records, along with the latest revision and text information (similar number of records). All other tables were blank — which is fine if you want to host wikipedia, but not fine if you want to gather inter-wiki page links and category metadata, which are stored in the pagelinks and categorylinks tables respectively.

Page link data is useful because it provides a basic graph of all wikipedia nodes. Category link data is useful because it provides decent classification information w/o the overhead of classification methods that rely on raw text.

Where to get the metadata:

Pagelink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-categorylinks.sql.gz

Categorylink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pagelinks.sql.gz

    I used wget and gzip -d to get the raw SQL data.

    How to load the metadata:

    • mysql -u arun -parun wikipedia < enwiki-20080103-categorylinks.sql
    • mysql -u arun -parun wikipedia < enwiki-20080103-pagelinks.sql

    I ran into a problem when trying to load the categorylink sql: I received the following MySQL error:

    ERROR 1071 (42000) at line 12: Specified key was too long; max key length is 1000 bytes

    I googled around and found a thread that said the error was happening because the database was UTF-8, and that the fix was to switch it to Latin-1. When I did this from the mysql commandline client:

    alter database character set latin1;

    and reloaded the categorylink sql, it worked. Moral of the story: create your wikipedia database with latin-1 encoding. If you are going to insert into a UTF-8 database, you will need to convert from Latin-1 to UTF8. I’m using Ruby, so I’m going to use Iconv to convert into UTF-8 prior to inserting into my UTF-8 database.

    One thing that happened when I tried to confirm the number of loaded category/page links in the db was that I received a strange internal MySQL error. I tried to restart the machine and got a ‘/var/lib/mysql: partition too full!” message (and the database wouldnt start up). I fixed this by deleting some data. If you can’t delete any data, try the steps suggested here.

    When I was able to restart the database, the pagelinks table was corrupted. I ran

    repair table pagelinks;

    from the mysql commandline client as described in the mysql documentation. This took about 3 hours to repair the table, but the table is now repaired.

    Wikipedia Download Summary:

    • Total Page Count: 6202531
    • Total Pagelink Count: 77444718
    • Total CategoryLink Count: 18912664

    Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer

    January 30, 2008

    Continued from part 2 of N

    After working around the hidden commandline ordering of mwdumper and going home while the database was still being loaded, I walked in today to find 4.88 million pages and counting. No categorylinks, no pagelinks, etc. I’m hoping that those get populated before the job is complete, because aside from the page data, we’re using wikipedia categories to do a crude classification of entities, and pagelink data to construct a graph of pages and their connections.

    Based on the schema documentation, I’ve constructed an ActiveRecord based model of the tables that I’m interested in. Using ActiveRecord w/o Rails is pretty easy:

    (1) you need to explicitly connect to the database. ActiveRecord maintains a copy of the connection to use for all derived classes once this is done:

    ActiveRecord::Base.establish_connection(
    :adapter=>’mysql’,
    :host=>’localhost’,
    :database=>’wikipedia_test’,
    :username=>’arun’,
    :password=>’arun’)

    (2) You can then subclass the ActiveRecord::Base class as usual:

    class Page < ActiveRecord::Base
    set_table_name ‘page’
    set_primary_key ‘page_id’
    end

    Pretty (yawn!) straightforward so far. So I decided to mix it up a little by writing my tests in rpsec. I’ve been intrigued by Behavior Driven Development, aka BDD, and wanted to see how useful it would be in my day to day work. As I’ve said may times before, I’m a big fan of TDD. However I still find it an effort to write tests first. When I do, it feels great. But I often slip into a rut where the tests dont drive my coding as much as they should, and as a result I write unnecessary code.

    Perhaps the biggest advantage of rpsec over XUnit type unit testing is that I get to specify my object behavior prior to writing a single line of code for the object in something that looks / feels like English, which is something I’m fairly fluent in. Here is the starter rspec I wrote prior to actually coding the wikipedia model classes: I was treating those classes as a singular entity (aka ‘the model’) at this point:

    describe Page do

    it ’should retrieve a valid page’ do

    end

    it ’should retrieve a valid text’ do

    end

    it ’should retrieve a valid revision’ do

    end

    it ’should retrieve the latest text associated with a page via the associated revision’ do

    end

    it ’should retrieve all associated CategoryLinks for a Page’ do

    end

    it ’should retrieve all associated PageLinks for a Page’ do

    end

    end

    This allows me to really get my head around what the model should be capable of, then using the object extensions should and should_not, I can validate those assertions:

    it ’should retrieve the latest text associated with a page via the associated revision’ do
    test_text = “test text for page_1″
    @text = Text.new

    @text.attributes = load_text_attribs(test_text)
    @text.save

    @page = Page.new
    @page.attributes = load_page_attribs(‘page_1′)
    @page.save

    @revision = Revision.new
    @revision.attributes = load_revision_attribs(@page.page_id,@text.old_id)
    @revision.save

    latest_rev = @page.get_latest_revision
    latest_rev.should_not eql(nil)
    text = latest_rev.text
    text.should_not eql(nil)
    text.old_text.should eql(test_text)
    end

    In order to get ActiveRecord Models loading in RSpec, I did a couple of things:

    (1) I created a connection before running any tests. In rspec, the before(:all) method allows me to specify a block of code that runs prior to any test execution.

    before(:all) do
    ActiveRecord::Base.establish_connection(
    :adapter=>’mysql’,
    :host=>’localhost’,
    :database=>’wikipedia_test’,
    :username=>’arun’,
    :password=>’arun’)

    end

    (2) I made sure that data was getting deleted prior to every test by using the before() method with the :each symbol:

    before(:each) do

    #clear out all data

    end

    One significant difference between Rspec and TestUnit is the lack of Fixture support. I was shoving in fixture support, aka putting a square peg into a round hole, when I decided to see how Rspec users felt about fixtures. Turns out they are not particularly fond of them. One way to do load test code w/o fixtures is to include model specific helper classes to load data.

    module PageHelper

    def load_page_attribs(title)
    {
    :page_title=>title,
    :page_namespace=>2,
    :page_random=> 0,
    :page_touched=> 0,
    :page_latest=> 0,
    :page_len=> 200
    }
    end

    end

    page = Page.new

    page.attributes = load_attributes

    page.save

    In the end, even though it co-mingles data with code, which is commonly perceived as a bad thing to do — hence fixtures! — the less code approach seems more manageable b/c I can make the modification directly in the spec file, and the use of Ruby symbols makes the code read as easy as a fixtures file.

    More wikipedia processing progress, same bat time, same bat channel!