Installing Wikipedia part 4 of N: getting additional wikipedia metadata

Continued from yesterday:

The loading of ‘page’ table data  finished after approx 16 hours on a 2.2GHz dual proc, 4GB machine with approx 6.5 million page records, along with the latest revision and text information (similar number of records). All other tables were blank — which is fine if you want to host wikipedia, but not fine if you want to gather inter-wiki page links and category metadata, which are stored in the pagelinks and categorylinks tables respectively.

Page link data is useful because it provides a basic graph of all wikipedia nodes. Category link data is useful because it provides decent classification information w/o the overhead of classification methods that rely on raw text.

Where to get the metadata:

Pagelink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-categorylinks.sql.gz

Categorylink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pagelinks.sql.gz

    I used wget and gzip -d to get the raw SQL data.

    How to load the metadata:

    • mysql -u arun -parun wikipedia < enwiki-20080103-categorylinks.sql
    • mysql -u arun -parun wikipedia < enwiki-20080103-pagelinks.sql

    I ran into a problem when trying to load the categorylink sql: I received the following MySQL error:

    ERROR 1071 (42000) at line 12: Specified key was too long; max key length is 1000 bytes

    I googled around and found a thread that said the error was happening because the database was UTF-8, and that the fix was to switch it to Latin-1. When I did this from the mysql commandline client:

    alter database character set latin1;

    and reloaded the categorylink sql, it worked. Moral of the story: create your wikipedia database with latin-1 encoding. If you are going to insert into a UTF-8 database, you will need to convert from Latin-1 to UTF8. I’m using Ruby, so I’m going to use Iconv to convert into UTF-8 prior to inserting into my UTF-8 database.

    One thing that happened when I tried to confirm the number of loaded category/page links in the db was that I received a strange internal MySQL error. I tried to restart the machine and got a ‘/var/lib/mysql: partition too full!” message (and the database wouldnt start up). I fixed this by deleting some data. If you can’t delete any data, try the steps suggested here.

    When I was able to restart the database, the pagelinks table was corrupted. I ran

    repair table pagelinks;

    from the mysql commandline client as described in the mysql documentation. This took about 3 hours to repair the table, but the table is now repaired.

    Wikipedia Download Summary:

    • Total Page Count: 6202531
    • Total Pagelink Count: 77444718
    • Total CategoryLink Count: 18912664
    Advertisements

    8 Responses to Installing Wikipedia part 4 of N: getting additional wikipedia metadata

    1. indigene says:

      Arun

      Read with great interest your travails to get the local Wikipedia up and running. Phew!

      I am building a web application using the mediawiki platform on LAMP. Currently doing the DB design with MySQL Work Bench.

      I have imported the mediawiki schema you mention in your article:

      http://www.mediawiki.org/wiki/Manual:Database_layout

      using this schema xml file:
      http://files.nickj.org/MediaWiki/mediawiki-dbdesigner-schema-data.xml)

      which also generated this cute layout pic:

      But would love to look at an actual layout of the wikipedia db. Can you generate a db schema file from your local Wikipedia and send me in a format (sql/xml) that I can feed into Work Bench?

      I am indigene2007 on gmail

      Thanks in advance,
      Warm Regards

    2. geoffsyndicate says:

      Hi,

      How long did it take to load the category links and page links? The category links script locks the table, so I don’t know of any way to tell how far through it is. It’s been going for a while now…

    3. arunxjacob says:

      It didnt take more than an hour, what kind of hardware are you running on? I was running this on a quad core 64 bit server.

    4. geoffsyndicate says:

      I’m on a Macbook Pro, 2.4Ghz with 4 gigs of RAM. I’m thinking about cancelling it and removing the locks on the sql so I can monitor it’s progress. Thoughts?

    5. geoffsyndicate says:

      Although I found this series of posts to be very useful for first-time Wikipedia importers (thank you for posting this BTW), I eventually gave up on MySQL. It turns out that MySQL was crashing pretty severely when I tried to import the categorylinks data, so I have written a conversion script and imported the whole lot into PostgreSQL (without any problems).

      There you go. MySQL == bad, PostgreSQL == good.

    6. arunxjacob says:

      That’s strange, we actually had issues getting it to run under postgres. I wasn’t the guy trying to do this, but he was bummed he couldn’t get it working on ‘a real database instead of that toy MySQL database!’ (his words, not mine, I’m not religious about this stuff 🙂

    7. wantondevious says:

      Hi,

      I hadn’t found this before I started this project myself. However, the approach I took was to install Wikimedia, which in turn creates the database (plus allows you to render the dump). Currently Im at 4.7 mill pages after 1 day – Im cursing that I didnt turn off Binary Logging and DISABLE the page and text indexes.

      I’m not sure whether this would make it faster, but turning off indexes usually does.
      In general binary log is ok in MySQL (doc claims only 1% overhead – but… we’re loading 12-13 GB of text – and so this creates 12-13 GB worth of logs.

      My question is – what does the pagelinks actually DO? MediaWiki seems to render everything just fine, without this extraction. How would it generate this table in normal usage? Or is it an adhoc table thats only populates on demand?

      I’d like to have this data, but if mediawiki will generate for me, seems silly to upload the pagelink table.

    8. arunxjacob says:

      I was more interested in the raw schema than the rendered contents, which is why I didn’t go the wikimedia route. Pagelinks shows the inter page links, which correlates to the relevancy of the page being linked to.

    Leave a Reply

    Please log in using one of these methods to post your comment:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    %d bloggers like this: