Installing Wikipedia, part 2 of N: getting the !@?! data.

Continued from Part 1 of N

Where to start….first of all, the high level steps.

  1. Download the dump file
  2. convert it to a huge, uncompressed SQL file
  3. import that file into mysql

Downloading the Dump File

Potential (Dis)Qualifier: I’m doing this on a debian box, assuming standard tools available on that distro. You can find the (english) file dumps at http://download.wikimedia.org/enwiki/, the subdirectories correspond to the dates of the dumps. There are several compressed (bz2) XML files, and several compressed (gzip) SQL files. Right now, I don’t know the degree of overlap between the two, but I have downloaded the latest pages-articles.xml.bz2, mainly because I didn’t want the extra user/discussion information found in pages-meta-current.xml.bz2.

Here are the steps I took to download and extract the SQL that I then loaded into MySQL:

wget http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pages-articles.xml.bz2

Then I ran

bzip2 -d enwiki-20080103-pages-articles.xml.bz2

to uncompress to a huge (14GB) xml file.

How to Convert that Bad Boy into SQL

There are many ways to do this, detailed on this page. One thing I hate is 10 different ways to do the same thing, so I did some research and found that most people have had success using mwdumper. I downloaded the latest mwdumper.jar from http://download.wikimedia.org/tools/mwdumper.jar, and ran it as follows:

java -jar mwdumper.jar enwiki-latest-pages-articles.xml –format=sql:1.5 > dump.sql

Note the order of the parameters. It is important to list the file you are dumping from prior to the format, when I didn’t do this I received a bizarre SQL insertion error:

ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘ …

Loading into MySQL

The MWDumper execution line above can be used with a pipe to directly route data into MySQL. I tried this at first, but ran into the out of order parameter problem, and backed up to creating a real SQL file prior to finding out about the parameter order issue. I think breaking XML->SQL conversion and SQL insertion is a good idea in general, because I haven’t gotten all of the SQL in clean yet, and having that file enables me to make multiple passes.

There are a couple of changes to the mysql config file that have improved the speed of the upload: the file is located at /etc/mysql/my.cnf

(1) increase the size of the innodb log file and log file buffer, it defaults to 5MB and that means lots of disk I/O. Here are the settings I used, note that these were conservative because I don’t have exclusive use of all 4GB of my machine’s RAM.

innodb_log_file_size=20M
innodb_log_buffer_size=8M

(2) turn off log_bin — logging transactions is a great way to make sure you don’t lose data, but I’m uploading to a single machine and don’t need the overload. I comment out default settings as a general rule:

#log_bin = /var/log/mysql/mysql-bin.log
# WARNING: Using expire_logs_days without bin_log crashes the server! See README.Debian!
#expire_logs_days = 10
#max_binlog_size = 100M

After making these changes, I restarted mysql:

/etc/init.d/mysql restart

Then I created a UTF-8 encoded database to download into using the mysql commandline:

mysql -u username -ppassword (note that -p only takes a password if you don’t leave a space)

Once in the mysql commandline client:

create database wikipedia CHARACTER SET utf8; (make sure you put the semicolon on the end)

Now leave the client (\q), because we need to load the database with a schema, otherwise the SQL load wont work — it’s just a table insert script that assumes tables. Download the latest schema from

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?revision=29128

— this is the head at the time I wrote this, visit the mediawiki vcview page to get the latest version. This Mediawiki page provides a great overview of the schema.

You can insert the schema into the database by doing:

mysql -u username -ppassword wikipedia < tables.sql

Now, with the wikipedia database schema set up, you can insert data:

mysql -u username -ppassword wikipedia < dump.sql (assuming dump.sql is the name of the sql file you generated using MWDumper)

Now it’s time to step away from the machine — I ran this job nohup so I could get on with my life. It’s still running, and I’ll update the blog with the results in my next wikipedia related post.

Advertisements

2 Responses to Installing Wikipedia, part 2 of N: getting the !@?! data.

  1. […] No matter where I go, there I am sometimes that’s nice, sometimes not so much… « Installing Wikipedia, part 2 of N: getting the !@?! data. […]

  2. antonyr says:

    Nice detailed instructions, was very useful for me in importing wikipedia data

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: