Continued from yesterday:
The loading of ‘page’ table data finished after approx 16 hours on a 2.2GHz dual proc, 4GB machine with approx 6.5 million page records, along with the latest revision and text information (similar number of records). All other tables were blank — which is fine if you want to host wikipedia, but not fine if you want to gather inter-wiki page links and category metadata, which are stored in the pagelinks and categorylinks tables respectively.
Page link data is useful because it provides a basic graph of all wikipedia nodes. Category link data is useful because it provides decent classification information w/o the overhead of classification methods that rely on raw text.
Where to get the metadata:
Pagelink data is available at
http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-categorylinks.sql.gz
Categorylink data is available at
http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pagelinks.sql.gz
I used wget and gzip -d to get the raw SQL data.
How to load the metadata:
- mysql -u arun -parun wikipedia < enwiki-20080103-categorylinks.sql
- mysql -u arun -parun wikipedia < enwiki-20080103-pagelinks.sql
I ran into a problem when trying to load the categorylink sql: I received the following MySQL error:
ERROR 1071 (42000) at line 12: Specified key was too long; max key length is 1000 bytes
I googled around and found a thread that said the error was happening because the database was UTF-8, and that the fix was to switch it to Latin-1. When I did this from the mysql commandline client:
alter database character set latin1;
and reloaded the categorylink sql, it worked. Moral of the story: create your wikipedia database with latin-1 encoding. If you are going to insert into a UTF-8 database, you will need to convert from Latin-1 to UTF8. I’m using Ruby, so I’m going to use Iconv to convert into UTF-8 prior to inserting into my UTF-8 database.
One thing that happened when I tried to confirm the number of loaded category/page links in the db was that I received a strange internal MySQL error. I tried to restart the machine and got a ‘/var/lib/mysql: partition too full!” message (and the database wouldnt start up). I fixed this by deleting some data. If you can’t delete any data, try the steps suggested here.
When I was able to restart the database, the pagelinks table was corrupted. I ran
repair table pagelinks;
from the mysql commandline client as described in the mysql documentation. This took about 3 hours to repair the table, but the table is now repaired.
Wikipedia Download Summary:
- Total Page Count: 6202531
- Total Pagelink Count: 77444718
- Total CategoryLink Count: 18912664