Continued from Part 1 of N
Where to start….first of all, the high level steps.
- Download the dump file
- convert it to a huge, uncompressed SQL file
- import that file into mysql
Downloading the Dump File
Potential (Dis)Qualifier: I’m doing this on a debian box, assuming standard tools available on that distro. You can find the (english) file dumps at http://download.wikimedia.org/enwiki/, the subdirectories correspond to the dates of the dumps. There are several compressed (bz2) XML files, and several compressed (gzip) SQL files. Right now, I don’t know the degree of overlap between the two, but I have downloaded the latest pages-articles.xml.bz2, mainly because I didn’t want the extra user/discussion information found in pages-meta-current.xml.bz2.
Here are the steps I took to download and extract the SQL that I then loaded into MySQL:
Then I ran
bzip2 -d enwiki-20080103-pages-articles.xml.bz2
to uncompress to a huge (14GB) xml file.
How to Convert that Bad Boy into SQL
There are many ways to do this, detailed on this page. One thing I hate is 10 different ways to do the same thing, so I did some research and found that most people have had success using mwdumper. I downloaded the latest mwdumper.jar from http://download.wikimedia.org/tools/mwdumper.jar, and ran it as follows:
java -jar mwdumper.jar enwiki-latest-pages-articles.xml –format=sql:1.5 > dump.sql
Note the order of the parameters. It is important to list the file you are dumping from prior to the format, when I didn’t do this I received a bizarre SQL insertion error:
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘ …
Loading into MySQL
The MWDumper execution line above can be used with a pipe to directly route data into MySQL. I tried this at first, but ran into the out of order parameter problem, and backed up to creating a real SQL file prior to finding out about the parameter order issue. I think breaking XML->SQL conversion and SQL insertion is a good idea in general, because I haven’t gotten all of the SQL in clean yet, and having that file enables me to make multiple passes.
There are a couple of changes to the mysql config file that have improved the speed of the upload: the file is located at /etc/mysql/my.cnf
(1) increase the size of the innodb log file and log file buffer, it defaults to 5MB and that means lots of disk I/O. Here are the settings I used, note that these were conservative because I don’t have exclusive use of all 4GB of my machine’s RAM.
(2) turn off log_bin — logging transactions is a great way to make sure you don’t lose data, but I’m uploading to a single machine and don’t need the overload. I comment out default settings as a general rule:
#log_bin = /var/log/mysql/mysql-bin.log
# WARNING: Using expire_logs_days without bin_log crashes the server! See README.Debian!
#expire_logs_days = 10
#max_binlog_size = 100M
After making these changes, I restarted mysql:
Then I created a UTF-8 encoded database to download into using the mysql commandline:
mysql -u username -ppassword (note that -p only takes a password if you don’t leave a space)
Once in the mysql commandline client:
create database wikipedia CHARACTER SET utf8; (make sure you put the semicolon on the end)
Now leave the client (\q), because we need to load the database with a schema, otherwise the SQL load wont work — it’s just a table insert script that assumes tables. Download the latest schema from
You can insert the schema into the database by doing:
mysql -u username -ppassword wikipedia < tables.sql
Now, with the wikipedia database schema set up, you can insert data:
mysql -u username -ppassword wikipedia < dump.sql (assuming dump.sql is the name of the sql file you generated using MWDumper)
Now it’s time to step away from the machine — I ran this job nohup so I could get on with my life. It’s still running, and I’ll update the blog with the results in my next wikipedia related post.