One of my mandates at work is to build up a huge repository of information taken from free and not free, structured and unstructured data sources. Naturally we targeted Wikipedia as the best seed source to get us started. After writing some very specialized table scrapers in Ruby, we decided to take another approach and just ingest the entire Wikipedia database.
Wikipedia has a “don’t scrape our pages, just download our file dumps” policy, which makes complete sense. Because of this (and the IP blacklisting that occurs when you scrape too fast/too much), our goal is to harvest information from an internal Wikipedia site.
Wikpedia provides a set of tools/instructions to take to install a version of Wikipedia locally, starting with processing the dump into Mysql and then installing/running MediaWiki, the software that converts the contents of the database into displayable and editable wiki content.
That is the theory, at least. Here is what I’ve found while trying to follow these instructions: Note that some if not most of the drama below is self manufactured, but hopefully, someone, somewhere will find this information useful.
- There is too much information — i.e. six different ways to do the same thing — in some places, and not enough — i.e. what does an actual XML dump contain vs the SQL dumps — in others.
- As with most batch processes, error handling and recovery are painful. I am going to share my ‘lessons learned’ wrt wikipedia tools in a future post.
- There is no mapping of SQL to XML files in the download dirs — this may only be a problem because I was unable to successfully translate all XML into SQL , but it would still be nice to see that pages.sql contains text and revisions (because it’s not obvious that the other SQL files contain any of that information).
- There are plenty of ways to go off into the weeds. For example, I just spent 1/2 hour debugging why my sql upload failed at 68K entries. Turns out that the mwdumper output was getting folded into the sql file because I was running the job as nohup. Right now I’m going to run it as a console job, and investigate how to separate stdout from errout when nohup is invoked when I import it into the database.
In addition, I’m also using ActiveRecord to access/manipulate the data once it (eventually) gets into the database. ActiveRecord w/o Rails merits some discussions, primarily because I’m also using it with RSpec, which I believe is a much more natural way to validate code.
I’m going to update my daily progress here so that I don’t have to go through this again. Stay tuned!