Installing Wikipedia part 4 of N: getting additional wikipedia metadata

January 30, 2008

Continued from yesterday:

The loading of ‘page’ table data  finished after approx 16 hours on a 2.2GHz dual proc, 4GB machine with approx 6.5 million page records, along with the latest revision and text information (similar number of records). All other tables were blank — which is fine if you want to host wikipedia, but not fine if you want to gather inter-wiki page links and category metadata, which are stored in the pagelinks and categorylinks tables respectively.

Page link data is useful because it provides a basic graph of all wikipedia nodes. Category link data is useful because it provides decent classification information w/o the overhead of classification methods that rely on raw text.

Where to get the metadata:

Pagelink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-categorylinks.sql.gz

Categorylink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pagelinks.sql.gz

    I used wget and gzip -d to get the raw SQL data.

    How to load the metadata:

    • mysql -u arun -parun wikipedia < enwiki-20080103-categorylinks.sql
    • mysql -u arun -parun wikipedia < enwiki-20080103-pagelinks.sql

    I ran into a problem when trying to load the categorylink sql: I received the following MySQL error:

    ERROR 1071 (42000) at line 12: Specified key was too long; max key length is 1000 bytes

    I googled around and found a thread that said the error was happening because the database was UTF-8, and that the fix was to switch it to Latin-1. When I did this from the mysql commandline client:

    alter database character set latin1;

    and reloaded the categorylink sql, it worked. Moral of the story: create your wikipedia database with latin-1 encoding. If you are going to insert into a UTF-8 database, you will need to convert from Latin-1 to UTF8. I’m using Ruby, so I’m going to use Iconv to convert into UTF-8 prior to inserting into my UTF-8 database.

    One thing that happened when I tried to confirm the number of loaded category/page links in the db was that I received a strange internal MySQL error. I tried to restart the machine and got a ‘/var/lib/mysql: partition too full!” message (and the database wouldnt start up). I fixed this by deleting some data. If you can’t delete any data, try the steps suggested here.

    When I was able to restart the database, the pagelinks table was corrupted. I ran

    repair table pagelinks;

    from the mysql commandline client as described in the mysql documentation. This took about 3 hours to repair the table, but the table is now repaired.

    Wikipedia Download Summary:

    • Total Page Count: 6202531
    • Total Pagelink Count: 77444718
    • Total CategoryLink Count: 18912664

    Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer

    January 30, 2008

    Continued from part 2 of N

    After working around the hidden commandline ordering of mwdumper and going home while the database was still being loaded, I walked in today to find 4.88 million pages and counting. No categorylinks, no pagelinks, etc. I’m hoping that those get populated before the job is complete, because aside from the page data, we’re using wikipedia categories to do a crude classification of entities, and pagelink data to construct a graph of pages and their connections.

    Based on the schema documentation, I’ve constructed an ActiveRecord based model of the tables that I’m interested in. Using ActiveRecord w/o Rails is pretty easy:

    (1) you need to explicitly connect to the database. ActiveRecord maintains a copy of the connection to use for all derived classes once this is done:

    ActiveRecord::Base.establish_connection(
    :adapter=>’mysql’,
    :host=>’localhost’,
    :database=>’wikipedia_test’,
    :username=>’arun’,
    :password=>’arun’)

    (2) You can then subclass the ActiveRecord::Base class as usual:

    class Page < ActiveRecord::Base
    set_table_name ‘page’
    set_primary_key ‘page_id’
    end

    Pretty (yawn!) straightforward so far. So I decided to mix it up a little by writing my tests in rpsec. I’ve been intrigued by Behavior Driven Development, aka BDD, and wanted to see how useful it would be in my day to day work. As I’ve said may times before, I’m a big fan of TDD. However I still find it an effort to write tests first. When I do, it feels great. But I often slip into a rut where the tests dont drive my coding as much as they should, and as a result I write unnecessary code.

    Perhaps the biggest advantage of rpsec over XUnit type unit testing is that I get to specify my object behavior prior to writing a single line of code for the object in something that looks / feels like English, which is something I’m fairly fluent in. Here is the starter rspec I wrote prior to actually coding the wikipedia model classes: I was treating those classes as a singular entity (aka ‘the model’) at this point:

    describe Page do

    it ‘should retrieve a valid page’ do

    end

    it ‘should retrieve a valid text’ do

    end

    it ‘should retrieve a valid revision’ do

    end

    it ‘should retrieve the latest text associated with a page via the associated revision’ do

    end

    it ‘should retrieve all associated CategoryLinks for a Page’ do

    end

    it ‘should retrieve all associated PageLinks for a Page’ do

    end

    end

    This allows me to really get my head around what the model should be capable of, then using the object extensions should and should_not, I can validate those assertions:

    it ‘should retrieve the latest text associated with a page via the associated revision’ do
    test_text = “test text for page_1”
    @text = Text.new

    @text.attributes = load_text_attribs(test_text)
    @text.save

    @page = Page.new
    @page.attributes = load_page_attribs(‘page_1’)
    @page.save

    @revision = Revision.new
    @revision.attributes = load_revision_attribs(@page.page_id,@text.old_id)
    @revision.save

    latest_rev = @page.get_latest_revision
    latest_rev.should_not eql(nil)
    text = latest_rev.text
    text.should_not eql(nil)
    text.old_text.should eql(test_text)
    end

    In order to get ActiveRecord Models loading in RSpec, I did a couple of things:

    (1) I created a connection before running any tests. In rspec, the before(:all) method allows me to specify a block of code that runs prior to any test execution.

    before(:all) do
    ActiveRecord::Base.establish_connection(
    :adapter=>’mysql’,
    :host=>’localhost’,
    :database=>’wikipedia_test’,
    :username=>’arun’,
    :password=>’arun’)

    end

    (2) I made sure that data was getting deleted prior to every test by using the before() method with the :each symbol:

    before(:each) do

    #clear out all data

    end

    One significant difference between Rspec and TestUnit is the lack of Fixture support. I was shoving in fixture support, aka putting a square peg into a round hole, when I decided to see how Rspec users felt about fixtures. Turns out they are not particularly fond of them. One way to do load test code w/o fixtures is to include model specific helper classes to load data.

    module PageHelper

    def load_page_attribs(title)
    {
    :page_title=>title,
    :page_namespace=>2,
    :page_random=> 0,
    :page_touched=> 0,
    :page_latest=> 0,
    :page_len=> 200
    }
    end

    end

    page = Page.new

    page.attributes = load_attributes

    page.save

    In the end, even though it co-mingles data with code, which is commonly perceived as a bad thing to do — hence fixtures! — the less code approach seems more manageable b/c I can make the modification directly in the spec file, and the use of Ruby symbols makes the code read as easy as a fixtures file.

    More wikipedia processing progress, same bat time, same bat channel!


    Installing Wikipedia, part 2 of N: getting the !@?! data.

    January 29, 2008

    Continued from Part 1 of N

    Where to start….first of all, the high level steps.

    1. Download the dump file
    2. convert it to a huge, uncompressed SQL file
    3. import that file into mysql

    Downloading the Dump File

    Potential (Dis)Qualifier: I’m doing this on a debian box, assuming standard tools available on that distro. You can find the (english) file dumps at http://download.wikimedia.org/enwiki/, the subdirectories correspond to the dates of the dumps. There are several compressed (bz2) XML files, and several compressed (gzip) SQL files. Right now, I don’t know the degree of overlap between the two, but I have downloaded the latest pages-articles.xml.bz2, mainly because I didn’t want the extra user/discussion information found in pages-meta-current.xml.bz2.

    Here are the steps I took to download and extract the SQL that I then loaded into MySQL:

    wget http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pages-articles.xml.bz2

    Then I ran

    bzip2 -d enwiki-20080103-pages-articles.xml.bz2

    to uncompress to a huge (14GB) xml file.

    How to Convert that Bad Boy into SQL

    There are many ways to do this, detailed on this page. One thing I hate is 10 different ways to do the same thing, so I did some research and found that most people have had success using mwdumper. I downloaded the latest mwdumper.jar from http://download.wikimedia.org/tools/mwdumper.jar, and ran it as follows:

    java -jar mwdumper.jar enwiki-latest-pages-articles.xml –format=sql:1.5 > dump.sql

    Note the order of the parameters. It is important to list the file you are dumping from prior to the format, when I didn’t do this I received a bizarre SQL insertion error:

    ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘ …

    Loading into MySQL

    The MWDumper execution line above can be used with a pipe to directly route data into MySQL. I tried this at first, but ran into the out of order parameter problem, and backed up to creating a real SQL file prior to finding out about the parameter order issue. I think breaking XML->SQL conversion and SQL insertion is a good idea in general, because I haven’t gotten all of the SQL in clean yet, and having that file enables me to make multiple passes.

    There are a couple of changes to the mysql config file that have improved the speed of the upload: the file is located at /etc/mysql/my.cnf

    (1) increase the size of the innodb log file and log file buffer, it defaults to 5MB and that means lots of disk I/O. Here are the settings I used, note that these were conservative because I don’t have exclusive use of all 4GB of my machine’s RAM.

    innodb_log_file_size=20M
    innodb_log_buffer_size=8M

    (2) turn off log_bin — logging transactions is a great way to make sure you don’t lose data, but I’m uploading to a single machine and don’t need the overload. I comment out default settings as a general rule:

    #log_bin = /var/log/mysql/mysql-bin.log
    # WARNING: Using expire_logs_days without bin_log crashes the server! See README.Debian!
    #expire_logs_days = 10
    #max_binlog_size = 100M

    After making these changes, I restarted mysql:

    /etc/init.d/mysql restart

    Then I created a UTF-8 encoded database to download into using the mysql commandline:

    mysql -u username -ppassword (note that -p only takes a password if you don’t leave a space)

    Once in the mysql commandline client:

    create database wikipedia CHARACTER SET utf8; (make sure you put the semicolon on the end)

    Now leave the client (\q), because we need to load the database with a schema, otherwise the SQL load wont work — it’s just a table insert script that assumes tables. Download the latest schema from

    http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?revision=29128

    — this is the head at the time I wrote this, visit the mediawiki vcview page to get the latest version. This Mediawiki page provides a great overview of the schema.

    You can insert the schema into the database by doing:

    mysql -u username -ppassword wikipedia < tables.sql

    Now, with the wikipedia database schema set up, you can insert data:

    mysql -u username -ppassword wikipedia < dump.sql (assuming dump.sql is the name of the sql file you generated using MWDumper)

    Now it’s time to step away from the machine — I ran this job nohup so I could get on with my life. It’s still running, and I’ll update the blog with the results in my next wikipedia related post.


    Installing Wikipedia, part 1 of N

    January 28, 2008

    One of my mandates at work is to build up a huge repository of information taken from free and not free, structured and unstructured data sources. Naturally we targeted Wikipedia as the best seed source to get us started. After writing some very specialized table scrapers in Ruby, we decided to take another approach and just ingest the entire Wikipedia database.

    Wikipedia has a “don’t scrape our pages, just download our file dumps” policy, which makes complete sense. Because of this (and the IP blacklisting that occurs when you scrape too fast/too much), our goal is to harvest information from an internal Wikipedia site.

    Wikpedia provides a set of tools/instructions to take to install a version of Wikipedia locally, starting with processing the dump into Mysql and then installing/running MediaWiki, the software that converts the contents of the database into displayable and editable wiki content.

    That is the theory, at least. Here is what I’ve found while trying to follow these instructions: Note that some if not most of the drama below is self manufactured, but hopefully, someone, somewhere will find this information useful.

    1. There is too much information — i.e. six different ways to do the same thing — in some places, and not enough — i.e. what does an actual XML dump contain vs the SQL dumps — in others.
    2. As with most batch processes, error handling and recovery are painful. I am going to share my ‘lessons learned’ wrt wikipedia tools in a future post.
    3. There is no mapping of SQL to XML files in the download dirs — this may only be a problem because I was unable to successfully translate all XML into SQL , but it would still be nice to see that pages.sql contains text and revisions (because it’s not obvious that the other SQL files contain any of that information).
    4. There are plenty of ways to go off into the weeds. For example, I just spent 1/2 hour debugging why my sql upload failed at 68K entries. Turns out that the mwdumper output was getting folded into the sql file because I was running the job as nohup. Right now I’m going to run it as a console job, and investigate how to separate stdout from errout when nohup is invoked when I import it into the database.

    In addition, I’m also using ActiveRecord to access/manipulate the data once it (eventually) gets into the database. ActiveRecord w/o Rails merits some discussions, primarily because I’m also using it with RSpec, which I believe is a much more natural way to validate code.

      I’m going to update my daily progress here so that I don’t have to go through this again. Stay tuned!


      So much to do, so little time…

      January 26, 2008

      I’m in (geek) lust after reading over the sample chapter of Making Things Work, a book that details how to DIY a set of interconnected devices that you get to make from scratch. This book touches on the three things that made me fall in love with computers from the first time I touched a mainframe at age 10:

      (1) designing really fun devices (both virtual and physical)

      (2) making them talk to one another

      (3) creating the detailed interactions (and watching the ensuing hilarity).

      Since I haven’t even started to make progress on my exercise tracking/mapping/etc website, I’m going to have to hold off on this one, maybe put it out there as the carrot that will get me through the current home project, which I can only work on when my real job isn’t demanding all of my time, and right now that isn’t the case 😦


      My Early Morning Exercise Commandments

      January 21, 2008

      Tomorrow, at 5:45, I will get up and prepare to go on a nine mile run. It’s the only time of the day, between job and family, that I have to run, and while  I’m not a super hardcore kind of guy, you know, the kind they show in the Nike commercials, but going on a long run or ride in the morning actually helps the rest of the day go much better.

      Over the years I’ve found that the following suggestions are the only things that keep me rolling out of a warm and very comfortable bed:

      1. I give myself bonus points just for being awake, let alone awake and running.  That means I spend a lot of time immediately after I wake up  thinking “way to go! way to be awake! way to be stepping out into the cold and wet!”
      2. Those bonus points mean that  I don’t have to run fast if I don’t feel like it. Which is good, especially in the first few miles, because I usually start running at a pace generously termed as an ‘aggressive shuffle’.
      3. They also mean that I don’t have to run as far as I had previously planned the night before. In fact, if I’ve been running regularly for a while, and don’t feel like getting out of bed one morning, I won’t. I wait until I’m actually looking forward to running to start up again…
      4. That said, it’s best to get out and start going before I wake up and rational thought kicks in. The rational thing to do, 99% of the time, is to go back to bed.
      5. While I love running to heavy metal, I stay away from the iPod, because on a dark and windy road at 6:30 in the morning, I need my ears and everything else fully functioning. So I run more to an internal iPod, I get a good Ronny James Dio tune going in my head and use that to kick it up a notch. We Rock, anyone?
      6. When I get tired I pretend I’m a Kenyan, i.e. I weigh next to nothing, have lungs the size of a small house, and just float over the ground. I’m still a shuffling Indian, but the visualization has gotten me up more steeps than I can remember.
      7. I cycle through my body as I run, relaxing tensed up shoulders, concentrating on good form, analyzing my foot strike, etc. It’s amazing what you can find to focus on for an hour and a half.
      8. I walk home the last 1/4 mile. You would too if you lived at the top of a steep hill. This gives my legs time to warm down, and makes them less sore.
      9. I then ride into work. This is also a good warm down for the legs, it’s about 5 miles in, with one hill of any consequence.
      10. I take the next day off of running. I ride, or climb, or do something else. This lets my 39 year old body recover in time for the next early morning session.
      11. I try to make these morning runs ‘adventures’, by going on new routes, exploring different areas, reversing parts of loops, etc. Nothing sucks worse than being unexcited about a run because you’ve done the same route for the last 6 months.

      This has kept me fresh and excited about running for a long time now. Hopefully I can stay relatively injury free (ankle is bugging me lately) and keep it together for a couple of 1/2 marathons later in the year.


      Garmin Developer Site — Arrgghh

      January 8, 2008

      I’ve been trying (after work, after the wife and kids go to bed, and before I collapse) to make some progress on my GPS site, and ran into a really strange problem running the demos, detailed here. There doesn’t seem to be a workaround, but one of the posts pointed me to the Communicator Plugin Test Page. While the Plugin API was pretty slick, it’s giving me no love, and I’m going to have to try the code referenced in GarminDeviceControlDemo.js when I’m not as beat.

      Speaking of being beat, I ignored the wise rule to never increase distance, time, or intensity by > 10% per week, and took my 8 mile run at 9:30 average pace to 10 miles at 9:08 average pace. Now the 8 mile is a lot hillier, but the 10 mile loop had some hills in it too…I was beat the next day, and my ankle is a little sore. Add to that the excessive eating I’ve been doing and I feel like a total slug. Hopefully a long run on Wednesday, combined with the iron will to stay away from the office candy stash, will get me back to feeling more alert.


      Does a language complicate your thinking?

      January 6, 2008

      I was interviewing this kid the other day. I can rightfully call him a kid, since I’m at least 15 years older than him. But because calling someone ‘the Kid’ gets old, I’ll call him John Doe. He was straight out of school — up to this point we have been recruiting senior engineers, and to see someone this green was a new experience for me. I wasn’t expecting much, up to this point most of my interviews with college kids resulted in painful deer in the headlights moments.

      Long story short, John Doe had the best instincts I have seen on anyone, including people with much more impressive resumes. As a natural talent, he has the geek equivalent of a 54 inch box jump, a 90 VO2 Max, a +5 ape index. And he didn’t have any semi autistic behaviors, i.e. Bill Gates Rocking — he actually seemed quite well adjusted.

      I gave him my standard coding question, which goes like this:

      Given one 20GB disk of unsorted ints, an empty 20 GB disk, and a 1 GB memory buffer, move the unsorted ints to the empty disk, in sorted order.

      I love this question because there are many good answers to it. There are also some quick observations to be made:

      (1) Standard in-place sorting algorithms don’t work due to the space limitation.

      (2) The buffer is best used as a sorting mechanism.

      (3) When using the buffer, you will need to handle the case where it fills up.

      (4) Extracting the data from the buffer is as important as inserting it.

      You would be surprised at how many people take a long time to get past (1). They then have to be prodded into (2), and when they code an insertion algorithm, they don’t deal with (3). And if they get to (4), it is only as an afterthought.

      Mr Doe verbalized all 4 issues within the first 3 minutes, including the fact that any data structure he uses would decrease the efficiency of the algorithm because pointers take up space. He then proceeded to pick a binary tree, define it’s insertion efficiency, and code the traversal as an elegant 3 line ruby solution. He then coded the insertion logic just as elegantly, catching and handling the max buffer size exceeded case.

      Even the guys that made it this far tended to fill the board up with Java. I was struck by the conciseness and readability of Mr Doe’s solution. I remember coding the same solution in Java and using much more of the whiteboard, and I began to wonder if John Doe was just that much better than me (totally possible), or if my exposure to C++ and Java had made me incapable of expressing elegant solutions. This is a thought that I have had a lot lately, since I got pulled back into some java development that involved Spring, and spent a not insignificant amount of time re-ramping up on how Spring config files work, and watching other developers flail their way through that same process. After I had finished re-introducing myself to Java Enterprise Development and learning enough about JPA to be extremely dangerous, I found myself reading the latest Yegge novella and nodding my head. Alex Vollmer says it much better than I ever could: when a language is by definition bloated and inconsistent, it becomes harder and harder to be elegant and concise.

      I’m not entirely convinced that I can effectively blame a language for my over complexity disorder (OCD), but there are two things I am fairly sure of. (1), John Doe is in as far as I’m concerned. He is a rock star. (2) I’m going to spend the next little bit focusing on coding Ruby like a Rubyist would — i.e. taking advantage of conciseness and readability, and hopefully producing something that is elegant.


      GPS Research — Javascript, not so bad!

      January 2, 2008

      Part of my GPS research involves diving into the Garmin Communicator API, which is a Javascript based API built on prototype.  I’m beginning to realize that a lot of my frustrations with JS were based on my lack of exposure to dynamically typed languages as well as some more valid gripes, like no unit testing framework. Turns out I was wrong on both of these counts. My experience in Ruby has made me much more comfortable with dynamic type assignment/etc, and jsunit is a reasonably mature unit testing framework that the Garmin team is using to unit test their js. Its weird to me because I’ve specifically asked JS developers if there was a decent js unit testing framework out there, and their answer was ‘no’. Even guys who were not script kiddies, who actually wrote unit tests, scratched their heads on that one. I’m going to have to see if jsunit is all that, and if not, why not.

      I was able to get the examples up and running on my localhost site in no time flat, and that’s good, because I’m not as interested in mucking around with the details of device detection as I am in playing around with the data. Actually, I am getting a strange error when uploading data, but I want to poke through the code a little more to make sure it’s not pilot error, which it always seems to be 🙂

      One of my new years resolutions was to find something geeky out of work to think about, and this fits the bill for the time being. The other resolutions were to spend more time with friends in fine locations, get off my ass and start kiteboarding, and introduce Kiran and Leela to the joys of snowsports and swimming so that we can eventually ride and surf together…sorry, I’m in a pensive 2K8 mood. I’ll update when I’ve got the data upload working and I can start geeking with some correlation stuff around courses.

      —  Arun