Installing Wikipedia, part 2 of N: getting the !@?! data.

Continued from Part 1 of N

Where to start….first of all, the high level steps.

  1. Download the dump file
  2. convert it to a huge, uncompressed SQL file
  3. import that file into mysql

Downloading the Dump File

Potential (Dis)Qualifier: I’m doing this on a debian box, assuming standard tools available on that distro. You can find the (english) file dumps at, the subdirectories correspond to the dates of the dumps. There are several compressed (bz2) XML files, and several compressed (gzip) SQL files. Right now, I don’t know the degree of overlap between the two, but I have downloaded the latest pages-articles.xml.bz2, mainly because I didn’t want the extra user/discussion information found in pages-meta-current.xml.bz2.

Here are the steps I took to download and extract the SQL that I then loaded into MySQL:


Then I ran

bzip2 -d enwiki-20080103-pages-articles.xml.bz2

to uncompress to a huge (14GB) xml file.

How to Convert that Bad Boy into SQL

There are many ways to do this, detailed on this page. One thing I hate is 10 different ways to do the same thing, so I did some research and found that most people have had success using mwdumper. I downloaded the latest mwdumper.jar from, and ran it as follows:

java -jar mwdumper.jar enwiki-latest-pages-articles.xml –format=sql:1.5 > dump.sql

Note the order of the parameters. It is important to list the file you are dumping from prior to the format, when I didn’t do this I received a bizarre SQL insertion error:

ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘ …

Loading into MySQL

The MWDumper execution line above can be used with a pipe to directly route data into MySQL. I tried this at first, but ran into the out of order parameter problem, and backed up to creating a real SQL file prior to finding out about the parameter order issue. I think breaking XML->SQL conversion and SQL insertion is a good idea in general, because I haven’t gotten all of the SQL in clean yet, and having that file enables me to make multiple passes.

There are a couple of changes to the mysql config file that have improved the speed of the upload: the file is located at /etc/mysql/my.cnf

(1) increase the size of the innodb log file and log file buffer, it defaults to 5MB and that means lots of disk I/O. Here are the settings I used, note that these were conservative because I don’t have exclusive use of all 4GB of my machine’s RAM.


(2) turn off log_bin — logging transactions is a great way to make sure you don’t lose data, but I’m uploading to a single machine and don’t need the overload. I comment out default settings as a general rule:

#log_bin = /var/log/mysql/mysql-bin.log
# WARNING: Using expire_logs_days without bin_log crashes the server! See README.Debian!
#expire_logs_days = 10
#max_binlog_size = 100M

After making these changes, I restarted mysql:

/etc/init.d/mysql restart

Then I created a UTF-8 encoded database to download into using the mysql commandline:

mysql -u username -ppassword (note that -p only takes a password if you don’t leave a space)

Once in the mysql commandline client:

create database wikipedia CHARACTER SET utf8; (make sure you put the semicolon on the end)

Now leave the client (\q), because we need to load the database with a schema, otherwise the SQL load wont work — it’s just a table insert script that assumes tables. Download the latest schema from

— this is the head at the time I wrote this, visit the mediawiki vcview page to get the latest version. This Mediawiki page provides a great overview of the schema.

You can insert the schema into the database by doing:

mysql -u username -ppassword wikipedia < tables.sql

Now, with the wikipedia database schema set up, you can insert data:

mysql -u username -ppassword wikipedia < dump.sql (assuming dump.sql is the name of the sql file you generated using MWDumper)

Now it’s time to step away from the machine — I ran this job nohup so I could get on with my life. It’s still running, and I’ll update the blog with the results in my next wikipedia related post.

2 Responses to Installing Wikipedia, part 2 of N: getting the !@?! data.

  1. […] No matter where I go, there I am sometimes that’s nice, sometimes not so much… « Installing Wikipedia, part 2 of N: getting the !@?! data. […]

  2. antonyr says:

    Nice detailed instructions, was very useful for me in importing wikipedia data

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: