Powdercattin’ it

March 17, 2008

Some of the best memories I have of the 1990s is of ripping sidecountry runs with my friends at Stevens from the first chair to the last, pulling 9AM to 10PM days with a brief 5PM stop to dry off and power-nap.

We drifted apart over the last couple of years, due to lifestyle changes — jobs, infants, moves, etc. But we got back together this last weekend for a day of guided riding with Cascade Powder Cats. This late in the season, I didn’t exactly get my hopes up for epic conditions, but I was definitely looking forward to spending time with the old crew (and by old, I mean we’re all circling 40 and quickly moving north).

In fact, as the week started, temperatures were high and rain was plentiful, lowering my expectations of the conditions to basic guanch — aka Cascade Concrete — layered over random death cookies. But Tuesday night, temps dropped and the rain stepped up. I willfully ignored the snow reports, because (a) I wanted to follow the Law of Low Expectations, and (b) we were going to be guided, and so keeping up on snowfall, temp ranges, wind load, etc was effectively outsourced.

We pulled in late Thursday night to the pornographically named Mysty Mountain Cabin (I mean the only time I’ve ever seen anyone named Mysty was set to the backbeat of a lame disco soundtrack, and involved a Pizza Delivery Man) nestled just north of the damp and moss encrusted idyll of Skykomish. The next morning we woke up and rolled up to the headquarters of CPC, just off to the side of highway 2. We loaded up into one of the cats and trundled off around the ridge.

9 miles later we jumped out, underwent a brief recap of avy training using digital transceivers, and jumped back into the cat for a ride to the top.

One potentially limiting factor: I hadn’t ridden in 6 years. I have no idea how that happened, but there you have it. So as we unloaded, strapped in, and got some instructions from the guide on where to go, I was kind of wondering if I could even turn anymore. As usual, rational thought was overridden by the potential for fun. Fortunately, riding powder — especially light, fluffy powder — is like riding a bike. It all came back after the first three turns and there I was, with the rest of the guys, hooting and hollering like a complete idiot :)

me_and_jay.jpeg

Cat riding rules. The guys at CPC were awesome. Harlan, the lead guide, was super cool, setting us up epic after epic after epic… Jeff, the sweep guide and one of the owners, was as excited as we were to be out in such killer conditions. The cat driver (I forget his name) was a total stud, I had no idea cats were so maneuverable. At lunch they took us into the yurt for some sandwiches, tomato soup, and oreos. Any gourmands might be snorting in disgust right now, but that’s exactly the fare to refuel with in the middle of a big day. Fast and good eating, minimal down time. Back in the saddle.

ridin_high

This is what we rode: wide open bowls followed by perfectly spaced glades littered with super fun rollers that started with an ollie and ended with a pillow soft touchdown. Had we been younger, bolder, and riding more, they would have pointed us to cliff drops and windlips, but we were more than satisfied with the ride quality and selection.

where_my_teeth.jpeg

By late afternoon I was getting pretty beat, but only felt it on the ride up. Once we strapped in for the next one all pain was forgotten — sure, the reflexes were getting a little slower and the legs had a little less pop, but the spirit was still willing. Even though we had left the Vitamin I at home.

The last run was a hero run, the boys at CPC obviously were catering to our aging legs and our youthful egos. It started with a traverse into a bowl so wide open that we were all gripped with ‘Point it or Slash it ?!?’ syndome. I opted for GS turns, keeping the speed while feeling the g’s of a nice arcing powder turn, driving my back foot down and letting the front ride free. The snow was still light, and fast, and the angle was relaxed enough so that we could enjoy it without maxing our battered legs. It was the perfect way to end the day.

The next day, after passing out promptly at 10PM (so much for traditional bachelor party hijinks :) ), we rallied for a half day at Stevens, our old stomping grounds. Of course we all talked it down, claiming tiredness and expecting chopped up conditions. While inbounds was quite chopped up, requiring jump turns, focus, and legs I didn’t quite have, time had stood still in all of the powder stashes, and there was still incredible light pow to be poached just out of bounds on skiiers left past Southern Cross. We were 15 years older, but riding the same glades, ripping the same lines, smiling the same stupid grins, and feeling the same stoke.

old_skool_posse.jpeg

So now I’m heading back into a high pressure week at work, but feeling so much better than I would have sans 2 epic powder days. I think I’ve lost my way in the last couple of years. In the middle of having a family and getting real about work, I’ve forgotten some of the basic essentials of good living — basically making time to have great adventures outside with good friends that get as stoked as I do, whether we’re riding powder, or climbing, or doing a killer ride. It’s a little late for a new years resolution, but here I go. 2008, despite work and family and a busy life, needs to be balanced, sprinkled with days like the last couple. One of the guys said it much better than I could ever have: he turned to me in the cat after yet another epic ride and said “I mean it’s not like I’m going to be on my deathbed saying ‘I could’ve worked more, or finished that project earlier’. We’ve got to get back on this while we’ve got cash, time and legs.”


Data Liberation via the Script Tag and wrapped JSON

March 3, 2008

Mashups excite me because they allow a user to extract personal unique value from applications that were not designed to do so — at the risk of butchering a metaphor, it’s a recombinant effect — mashups leverage common protocols and allow people to mix data in ways that the original data providers couldn’t have imagined.

But this kind of a data sampling is not as easy as pulling data in from specific sites, due to cross domain scripting limitations that disallow javascript loaded from one domain to request data from another domain. In order to get around this limitation, you need to proxy the call to the other domain from your own server. This limits Mashup creation because I can’t just drop in a call to my favorite data provider on my blog or my homepage. Kind of a downer.

Fortunately there has been a way around the cross domain limitation for a while — here is my understanding of how it works.

The Script Tag

While Javascript is constrained by the same origin policy, the src attribute of the script tag is not. So you can load javascript from another domain with no ill effect. This is how you would load in js from another website. Note that if you load javascript via a script tag, you need to actually use that javascript in a subsequent script tag. So we can actually access javascript from another domain — now how do we actually use it?

Script Tag rendering

A Script executes the script in its src tag when it is rendered by the browser. If we return JSON from a src attribute of a script tag, we have instantly executable code. But we don’t actually have a way to access that code — the tag below:

<script src="http://arun.com/js/someJavaScript.js"/>

may map to

{{data}}

<script>

eval({{data}})

</script>

but there is nothing to assign that data back to — no way to assign the data back to javascript code.

Wrapping my JSON

The final piece in the puzzle is the use of callbacks — by wrapping JSON in a user specified callback, we can return data from an outside domain to a user defined function.

The user can specify the callback as a parameter to a method defined in the remotely included js:

<script src="http://otherdomain.com/js/some.js"/>

where some.js has that method defined:

function returnRemoteData(options) {

callback = options[callback];

var head = Document.getElementsByTagName("head")[0];

var newScript = document.createElement('script');

newScript.id = 'remoteAccessScript';

newScript.src="http://remotescript.com/js/remote.js?callback="+callback;

head.appendChild(script);

}

and the user calls it as follows:

<script>

function execRemote(data) {

var data = eval(data);
...
}

returnRemoteData({callback: execRemote});

</script>

returnRemoteData will cause a new script tag to be created that requests source from http://remotescript.com/js/remote.js, passing the user specified callback method as a parameter. This is important because the return from that call will be:

execRemote({{JSON data}}), which will get executed by the script tag.

If I were writing an API, a la Yahoo/Google, I would keep the user from having to know about the new script tag, etc, by wrapping it in my code. This makes it super easy for them to access the data I provide by including my js files in their page, and calling my functions with their callbacks.



Garmin TCX to KML: the Prelude (splitting my huge exported exercise file)

February 19, 2008

TCX is the Garmin proprietary file format that logs exercise information, here is a snippet:

<Activity Sport="Running">
<Id>2008-01-26T18:29:26Z</Id>
<Lap StartTime="2008-01-26T18:29:26Z">
<TotalTimeSeconds>6049.690000</TotalTimeSeconds>
<DistanceMeters>10347.431641</DistanceMeters>
<MaximumSpeed>6.847500</MaximumSpeed>
<Calories>1386</Calories>
<AverageHeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t">
<Value>121</Value>
</AverageHeartRateBpm>
<MaximumHeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t">
<Value>165</Value>
</MaximumHeartRateBpm>
<Intensity>Active</Intensity>
<Cadence>0</Cadence>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2008-01-26T18:29:27Z</Time>
<Position>
<LatitudeDegrees>47.297868</LatitudeDegrees>
<LongitudeDegrees>-121.287557</LongitudeDegrees>
</Position>
<AltitudeMeters>757.656250</AltitudeMeters>
<DistanceMeters>0.000000</DistanceMeters>
<HeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t">
<Value>75</Value>
</HeartRateBpm>
<SensorState>Absent</SensorState>
</Trackpoint>
<Trackpoint>
...
</Track>
</Lap>
<Creator xsi:type="Device_t">
<Name>Forerunner305</Name>
<UnitId>3322440126</UnitId>
<ProductID>484</ProductID>
<Version>
<VersionMajor>2</VersionMajor>
<VersionMinor>40</VersionMinor>
<BuildMajor>0</BuildMajor>
<BuildMinor>0</BuildMinor>
</Version>
</Creator>
</Activity>

KML is Google file format to display geodata, here is a snippet of a path that is overlaid on a map:

<?xml version="1.0" encoding="UTF-8"?>

<kml xmlns="http://earth.google.com/kml/2.2">

  <Document>

    <name>Paths</name>

    <description>Examples of paths. Note that the tessellate tag is by default

      set to 0. If you want to create tessellated lines, they must be authored

      (or edited) directly in KML.</description>

    <Style id="yellowLineGreenPoly">

      <LineStyle>

        <color>7f00ffff</color>

        <width>4</width>

      </LineStyle>

      <PolyStyle>

        <color>7f00ff00</color>

      </PolyStyle>

    </Style>

    <Placemark>

      <name>Absolute Extruded</name>

      <description>Transparent green wall with yellow outlines</description>

      <styleUrl>#yellowLineGreenPoly</styleUrl>

      <LineString>

        <extrude>1</extrude>

        <tessellate>1</tessellate>

        <altitudeMode>absolute</altitudeMode>

        <coordinates> -112.2550785337791,36.07954952145647,2357

          -112.2549277039738,36.08117083492122,2357

          -112.2552505069063,36.08260761307279,2357

          -112.2564540158376,36.08395660588506,2357

          -112.2580238976449,36.08511401044813,2357

          -112.2595218489022,36.08584355239394,2357

          -112.2608216347552,36.08612634548589,2357

          -112.262073428656,36.08626019085147,2357

          -112.2633204928495,36.08621519860091,2357

          -112.2644963846444,36.08627897945274,2357

          -112.2656969554589,36.08649599090644,2357

        </coordinates>

      </LineString>

    </Placemark>

  </Document>

</kml>

In order to display geodata, I need to convert the geo location specific part of TCX to KML. Fortunately, this guy had run into this issue before, and provided some XSLT to do the job here: http://www.oe-files.de/ge/tcx2kml.xsl. Thanks, Jorn, and sorry about the missing umlaut on your name, my codepage foo is not what it should be.

Unfortunately, when I export data from my mac based Garmin Training Center, I get over a years worth of information — there is no way in this program to export a day, a week, or a month. So my first task is to break out this huge a** file into digestible chunks. I’m opting for breaking out by activity right now, maybe later I can break out by time.

I thought about the quickest way to do this, after all I’m not in the mood to do anything laborious after putting the kids to bed. I’ve written SAX parsers before, and I’m way too lazy to keep around a bunch of state I need to refer to whenever I get a ‘tag encountered’ event. Plus, I had a sneaking suspicion that sed or something sed-like would do the job utilizing regex. One of my mentors used to tell me ‘Arun, you think you’re really smart and you go around inventing all of these rounder wheels. Why dont you just take the time to read a couple of man pages?’ He went on to say that those man pages were written by much smarter people than he or I, which really used to piss me off :)

Turns out csplit does an admirable job of splitting out files based on context that matches a specific regex. There are a couple of ‘gotchas’.

(1) put your regex in quotes, otherwise it will be interpreted by the command shell. This _really_ sucks when using xml tag syntax in your regex, i.e. /<Activity Sport=.*>/ gets interpreted as a set of pipe symbols with arbitrary characters between in.

(2) csplit can execute at max 100 times, it creates files in xx00 – xx99 format by default. You can change the numbering scheme, but not the limit. For any XML file with > 100 sections of extractable XML, this poses a problem.

(3) if you don’t specify -k (keep written files on error), and you have < 100 files written out, all files written for that run will get erased.

My version of csplit that split out the chunks:
csplit -k -f act exercise.out.tcx '//' {100}

This seems like a great time to actually write some code (as opposed to writing a SAX parser) — I need to drive csplit until there are no more <Activity> tags to individually extract. Ruby has become my script of choice lately, primarily because I can maintain it over time, also because of irb, the Ruby commandline shell, which allows me to ‘test drive’ commands I want to eventually put into a shell.

csplit writes out the number of bytes in each created file to stdout, which we can take advantage of:

ret = `csplit -f act input.tcx '/<Activity Sport=.*>/' {100}`
puts a newline delimited set of byte values of output files, all starting with ‘act’ and ranging from 00 to 99.

vals = ret.split

if(vals.length == 100)

allows us to see if we have more work to do, i.e. 99 files have been created. We take the last file, act99, copy it to a new directory to start over, and repeat until vals.length < 100:


while(continue == true)

# run csplit here.
puts "splitting files by <Activity> tag in #{newdir}..."
ret = `csplit -k -f act #{input_file} '/<Activity Sport=.*>/' '{100}'`
vals = ret.split
if(vals.length == 100)

count+=1
newdir = "../#{gen_new_dir(count)}"
puts("creating #{newdir}/#{input_file}")
Dir.mkdir("#{newdir}") if(File.exists?(newdir) == false)
`cp act99 #{newdir}/#{input_file}`
Dir.chdir("#{newdir}")

else

continue = false

end

end

What is left: take these files and see if the XSLT code above works with them or pukes — these are not standard TCX files anymore, so I’m not expecting much love. Also, extracting KML is only one part of what I want to do with these files — showing heart rate vs distance vs altitude, etc is also something that isn’t super well done in the existing freeware.


Why url mapping sucks in Java Servlet land, and what I did about it.

February 13, 2008

This is more of ‘notes to self’ (like anyone else actually reads this!) than anything else. I bounce around so much at my current job that I forget everything and have to figure it out again. One such example: Servlets. I’ve only written servlets when absolutely necessary, i.e. when I’ve had to prototype a service and didn’t really care about what paths were coming in, how the web app was deployed, etc. So I always go through a bit of a learning curve when working with Servlets, because I usually have forgotten everything I know about them.

I am working on converting a set of services that offer POX over HTTP (see this example)into something more RESTful. I’ll spare you the RESTafarian evangelism and just say that my life has become much easier once I started thinking of infinite resources constrained by a (very) finite set of verbs. As the number of brain cells I kill increases, I have had to put my remaining ones to work figuring out how to be as effective as I was back in the day, when I had brainpower to spare.

As part of that assignment, we have had to think about combining separate services into a single, meaningful, easy to grasp API. I will say that thinking in resources helps here, because it’s easy to have a resource Foo that has sub resources Bar, Star, and Var, and request those resources as /foo/bar, etc. But mapping that elegant and simple layer to a sub strata of what are basically RPC calls has taken some thought.

One thing we decided to do was to access all services that are currently residing in separate WARs into one web app. The original goal was to have this web app be a very simple shell, and let web.xml route messages to specific services. All was good, less code was to be written, and we were supposed to live happily ever after….except in order to map objects to messages, we would end up routing requests from path foo/bar to servlet X and requests from /foo/bar/something to servlet Y. This is because unlike the happy world of self contained RESTful resources, our services actually provide different kinds of functionality for the same resources. But we really want to fake ‘resourceful’ ness.

The thing about web.xml servlet mapping is that it is limited to heirarchical path mapping, i.e.

map /foo/bar/star/* to servlet x

map /foo/glar/* to servlet y

map *.bat to servlet z

you cannot take /foo/bar/star/mar and map it to servlet z if you’ve already mapped /foo/bar/star/* to servlet x. So you can’t mix and match path heirarchies to servlets.

The solution, after not much time spent browsing the Servlet spec (good read, btw) is to use the built in RequestDispatcher object created from the ServletContext. In web.xml, we mapped all of our service servlets to private paths that would never be called from the clientAPI:


<servlet-mapping>
<servlet-name>ServiceX</servlet-name>
<url-pattern>/service_x/*</url-pattern>
</servlet-mapping>

<servlet-mapping>
<servlet-name>ServiceY</servlet-name>
<url-pattern>/service_y/*</url-pattern>
</servlet-mapping>

<servlet-mapping>
<servlet-name>ServiceZ</servlet-name>
<url-pattern>/service_z/*</url-pattern>
</servlet-mapping>

<servlet-mapping>
<servlet-name>Default</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>

Note that I’ve got a Default servlet catching all requests, because the paths above aren’t exposed in documentation (even if they are hit, they resolve to no ops). In the Default servlet init method, we created request dispatchers for all servlets that we had specified:


_servletXDispatcher = this.getServletContext().getRequestDispatcher("/service_x/*");
_serviceYDispatcher = this.getServletContext().getRequestDispatcher("/service_y/*");
_serviceZDispatcher = this.getServletContext().getRequestDispatcher("/service_z/*");

Note that in order to get valid request dispatchers, we had to specify the servlet mappings as specified in web.xml

Now we have RequestDispatchers, which can forward requests on to servlets:
_servletXDispatcher.include(httpRequest,httpResponse);

However I still needed a way to map partial paths to different request dispatchers. I ended up creating an ObjectMatcher class that regex matched incoming strings to specified objects:

public class ObjectMatcher {

Map _patternsMatchServlets;

public ObjectMatcher() {
_patternsMatchServlets = new HashMap();
}

public void load(Map servletMap) {

Set keys = servletMap.keySet();

for(String key : keys) {

_patternsMatchServlets.put(Pattern.compile(key),servletMap.get(key));
}
}

public T match(String uriPattern) {
T servlet = null;
boolean matches = false;
Set patterns = _patternsMatchServlets.keySet();

for(Pattern pattern : patterns) {

Matcher match = pattern.matcher(uriPattern);

matches = match.find();
if(matches == true) {
servlet = _patternsMatchServlets.get(pattern);
break;
}

}

return servlet;
}
}

The load method in this object takes a map of regex values to objects. It then compiles the regex values into pattern objects. The match method uses those regexes to match against inbound strings, and returns the appropriate object, or null if an object isn’t found.

I loaded this object with a map of regex values to objects as follows:

Map rdMap = new HashMap();

// TODO: put all new paths for client API here.
rdMap.put(".*/entities.*", _queryReqDispatcher);
rdMap.put(".*/media.*", _queryReqDispatcher);
rdMap.put(".*/actions.*", _queryReqDispatcher);
rdMap.put("person/.*", _entityReqDispatcher);
rdMap.put("popular/*", _zgReqDispatcher);
rdMap.put("media/.*", _zgReqDispatcher);

_matcher.load(rdMap);
and called it from my default servlet doGet method like this:

public void doGet( HttpServletRequest request, HttpServletResponse response )
throws ServletException, IOException {
dispatch(request,response);

}
to get (fairly) pain free routing in a central location.


Is BDD the new TDD? Adventures with RSpec

February 5, 2008

At Evri, I have the privilege of working with people who make it their business to write software in the most productively lazy way possible, by that I mean they strenuously avoid making rounder wheels. So when I see one of them start to use a new technology, I can only conclude that the technology must be making their (coding) life easier.

The first time I heard about rspec was when Phil Hagelberg had a practice run of his RailsConf presentation ‘tightening the feedback loop’ at one of our brown bags. He mentioned rspec along with rcov and flog. While rcov and flog struck me as having immediate value, I wasn’t so convinced about rspec. After all, isnt that what TestUnit is for?

At the time I was head deep in some Java code and couldn’t quite get to trying out rspec. When I surfaced, I felt very complicated and was happy to dive back into Ruby. However I had still forgotten about rspec and was still doing ‘old school TDD’ until I noticed that Travis, a notoriously ‘lazy bastard’, had completely switched over to rspec.

So I tried it, still skeptical. The whole ‘BDD vs TDD’ thing still confuses me, it’s like two people arguing whether chartreuse is really yellow or green. The whole point is to specify your expectations first, right?

My skepticism quickly faded as I started to use rspec. The best thing I can say about rspec is that it makes writing tests first so much easier. I believe it’s due to the DSL. Using rspec let me focus on what I wanted my class to do in a way that felt much more natural than writing tests for specific failure conditions. Instead of saying

class FooTest < Test::Unit::TestCase

def test_valid_foo_returned()

class_under_test = classUnderTest.new

foo = class_under_test.method1

assert(foo != nil)

assert_equal(foo.class.to_s,”Foo”)

end

end

I would instead say

describe classUnderTest do

it ’should return a valid object Foo from method1 ‘ do

class_under_test = classUnderTest.new

foo = class_under_test.method1

foo.should_not eql(nil)

foo.class.to_s.should_eql(“Foo”)

end

end

I think a lot of people would look at the two code snippets above and think ‘chartreuse’. I know that’s what I was thinking. So what is the big deal?

First: the DSL lets me express my expectations about how the class under test behaves, using should and should_not. What I found is that tests tend to write themselves, and then Red/Green testing enables me to write the smallest amount of code to get past each line. In the example above, the description ‘it should return a valid object foo from method1′ allows me to stay clear about what I’m expecting from method1.

Compare that to the standard Test::Unit approach. The test that I wrote above does the same thing as the rspec code, but it doesn’t reinforce the fact that I’m testing a specific class and expecting specific behaviors. It may validate those behaviors, but I still have to go through a layer of translation, figuring out what each assertion really means, in order to understand the test.

That extra layer of translation makes Test::Unit start to feel heavier and slower because I’m still translating what I want to test into a test method, instead of having a method help me outline desired behavior and expectations. The extra layer is more energy I have to expend to use and maintain the test — energy that I could be using to write code, energy that I will not want to expend when I’m under deadline pressure.
I’m still playing around with rspec — I’m a newbie, and am still getting used to the way other users avoid fixtures, how and when to use mocks, stubs, and which is which, but so far I think it has gone a long ways toward keeping my coding restricted to fulfilling expectations and nothing more.

So life is easier with little effort — this is something that makes me extremely happy. Again, I don’t know whether to call it BDD, or TDD, or Fred, but this approach is working for me and I don’t care to debate the nuances. That said, I will continue to educate myself about the nuances and hope that some kind of enlightenment occurs :)

I will continue to explore rspec and other tools that make coding/maintenance easier. Specifically, I’m curious about:

  1. whether a story is an analog of a Test::Unit::TestSuite
  2. how matchers work — when I need them, etc.
  3. when to use a mock — when do I know that a real object is too painful/expensive? I’m not sure it makes sense to mock the model layer b/c I get implicit model layer testing when I use it, and if the model layer changes, my tests will (appropriately) break.

Installing Wikipedia part 4 of N: getting additional wikipedia metadata

January 30, 2008

Continued from yesterday:

The loading of ‘page’ table data  finished after approx 16 hours on a 2.2GHz dual proc, 4GB machine with approx 6.5 million page records, along with the latest revision and text information (similar number of records). All other tables were blank — which is fine if you want to host wikipedia, but not fine if you want to gather inter-wiki page links and category metadata, which are stored in the pagelinks and categorylinks tables respectively.

Page link data is useful because it provides a basic graph of all wikipedia nodes. Category link data is useful because it provides decent classification information w/o the overhead of classification methods that rely on raw text.

Where to get the metadata:

Pagelink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-categorylinks.sql.gz

Categorylink data is available at

http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pagelinks.sql.gz

    I used wget and gzip -d to get the raw SQL data.

    How to load the metadata:

    • mysql -u arun -parun wikipedia < enwiki-20080103-categorylinks.sql
    • mysql -u arun -parun wikipedia < enwiki-20080103-pagelinks.sql

    I ran into a problem when trying to load the categorylink sql: I received the following MySQL error:

    ERROR 1071 (42000) at line 12: Specified key was too long; max key length is 1000 bytes

    I googled around and found a thread that said the error was happening because the database was UTF-8, and that the fix was to switch it to Latin-1. When I did this from the mysql commandline client:

    alter database character set latin1;

    and reloaded the categorylink sql, it worked. Moral of the story: create your wikipedia database with latin-1 encoding. If you are going to insert into a UTF-8 database, you will need to convert from Latin-1 to UTF8. I’m using Ruby, so I’m going to use Iconv to convert into UTF-8 prior to inserting into my UTF-8 database.

    One thing that happened when I tried to confirm the number of loaded category/page links in the db was that I received a strange internal MySQL error. I tried to restart the machine and got a ‘/var/lib/mysql: partition too full!” message (and the database wouldnt start up). I fixed this by deleting some data. If you can’t delete any data, try the steps suggested here.

    When I was able to restart the database, the pagelinks table was corrupted. I ran

    repair table pagelinks;

    from the mysql commandline client as described in the mysql documentation. This took about 3 hours to repair the table, but the table is now repaired.

    Wikipedia Download Summary:

    • Total Page Count: 6202531
    • Total Pagelink Count: 77444718
    • Total CategoryLink Count: 18912664

    Installing Wikipedia, part 3 of N, using ActiveRecord and RSpec to build the model layer

    January 30, 2008

    Continued from part 2 of N

    After working around the hidden commandline ordering of mwdumper and going home while the database was still being loaded, I walked in today to find 4.88 million pages and counting. No categorylinks, no pagelinks, etc. I’m hoping that those get populated before the job is complete, because aside from the page data, we’re using wikipedia categories to do a crude classification of entities, and pagelink data to construct a graph of pages and their connections.

    Based on the schema documentation, I’ve constructed an ActiveRecord based model of the tables that I’m interested in. Using ActiveRecord w/o Rails is pretty easy:

    (1) you need to explicitly connect to the database. ActiveRecord maintains a copy of the connection to use for all derived classes once this is done:

    ActiveRecord::Base.establish_connection(
    :adapter=>’mysql’,
    :host=>’localhost’,
    :database=>’wikipedia_test’,
    :username=>’arun’,
    :password=>’arun’)

    (2) You can then subclass the ActiveRecord::Base class as usual:

    class Page < ActiveRecord::Base
    set_table_name ‘page’
    set_primary_key ‘page_id’
    end

    Pretty (yawn!) straightforward so far. So I decided to mix it up a little by writing my tests in rpsec. I’ve been intrigued by Behavior Driven Development, aka BDD, and wanted to see how useful it would be in my day to day work. As I’ve said may times before, I’m a big fan of TDD. However I still find it an effort to write tests first. When I do, it feels great. But I often slip into a rut where the tests dont drive my coding as much as they should, and as a result I write unnecessary code.

    Perhaps the biggest advantage of rpsec over XUnit type unit testing is that I get to specify my object behavior prior to writing a single line of code for the object in something that looks / feels like English, which is something I’m fairly fluent in. Here is the starter rspec I wrote prior to actually coding the wikipedia model classes: I was treating those classes as a singular entity (aka ‘the model’) at this point:

    describe Page do

    it ’should retrieve a valid page’ do

    end

    it ’should retrieve a valid text’ do

    end

    it ’should retrieve a valid revision’ do

    end

    it ’should retrieve the latest text associated with a page via the associated revision’ do

    end

    it ’should retrieve all associated CategoryLinks for a Page’ do

    end

    it ’should retrieve all associated PageLinks for a Page’ do

    end

    end

    This allows me to really get my head around what the model should be capable of, then using the object extensions should and should_not, I can validate those assertions:

    it ’should retrieve the latest text associated with a page via the associated revision’ do
    test_text = “test text for page_1″
    @text = Text.new

    @text.attributes = load_text_attribs(test_text)
    @text.save

    @page = Page.new
    @page.attributes = load_page_attribs(‘page_1′)
    @page.save

    @revision = Revision.new
    @revision.attributes = load_revision_attribs(@page.page_id,@text.old_id)
    @revision.save

    latest_rev = @page.get_latest_revision
    latest_rev.should_not eql(nil)
    text = latest_rev.text
    text.should_not eql(nil)
    text.old_text.should eql(test_text)
    end

    In order to get ActiveRecord Models loading in RSpec, I did a couple of things:

    (1) I created a connection before running any tests. In rspec, the before(:all) method allows me to specify a block of code that runs prior to any test execution.

    before(:all) do
    ActiveRecord::Base.establish_connection(
    :adapter=>’mysql’,
    :host=>’localhost’,
    :database=>’wikipedia_test’,
    :username=>’arun’,
    :password=>’arun’)

    end

    (2) I made sure that data was getting deleted prior to every test by using the before() method with the :each symbol:

    before(:each) do

    #clear out all data

    end

    One significant difference between Rspec and TestUnit is the lack of Fixture support. I was shoving in fixture support, aka putting a square peg into a round hole, when I decided to see how Rspec users felt about fixtures. Turns out they are not particularly fond of them. One way to do load test code w/o fixtures is to include model specific helper classes to load data.

    module PageHelper

    def load_page_attribs(title)
    {
    :page_title=>title,
    :page_namespace=>2,
    :page_random=> 0,
    :page_touched=> 0,
    :page_latest=> 0,
    :page_len=> 200
    }
    end

    end

    page = Page.new

    page.attributes = load_attributes

    page.save

    In the end, even though it co-mingles data with code, which is commonly perceived as a bad thing to do — hence fixtures! — the less code approach seems more manageable b/c I can make the modification directly in the spec file, and the use of Ruby symbols makes the code read as easy as a fixtures file.

    More wikipedia processing progress, same bat time, same bat channel!


    Installing Wikipedia, part 2 of N: getting the !@?! data.

    January 29, 2008

    Continued from Part 1 of N

    Where to start….first of all, the high level steps.

    1. Download the dump file
    2. convert it to a huge, uncompressed SQL file
    3. import that file into mysql

    Downloading the Dump File

    Potential (Dis)Qualifier: I’m doing this on a debian box, assuming standard tools available on that distro. You can find the (english) file dumps at http://download.wikimedia.org/enwiki/, the subdirectories correspond to the dates of the dumps. There are several compressed (bz2) XML files, and several compressed (gzip) SQL files. Right now, I don’t know the degree of overlap between the two, but I have downloaded the latest pages-articles.xml.bz2, mainly because I didn’t want the extra user/discussion information found in pages-meta-current.xml.bz2.

    Here are the steps I took to download and extract the SQL that I then loaded into MySQL:

    wget http://download.wikimedia.org/enwiki/20080103/enwiki-20080103-pages-articles.xml.bz2

    Then I ran

    bzip2 -d enwiki-20080103-pages-articles.xml.bz2

    to uncompress to a huge (14GB) xml file.

    How to Convert that Bad Boy into SQL

    There are many ways to do this, detailed on this page. One thing I hate is 10 different ways to do the same thing, so I did some research and found that most people have had success using mwdumper. I downloaded the latest mwdumper.jar from http://download.wikimedia.org/tools/mwdumper.jar, and ran it as follows:

    java -jar mwdumper.jar enwiki-latest-pages-articles.xml –format=sql:1.5 > dump.sql

    Note the order of the parameters. It is important to list the file you are dumping from prior to the format, when I didn’t do this I received a bizarre SQL insertion error:

    ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘ …

    Loading into MySQL

    The MWDumper execution line above can be used with a pipe to directly route data into MySQL. I tried this at first, but ran into the out of order parameter problem, and backed up to creating a real SQL file prior to finding out about the parameter order issue. I think breaking XML->SQL conversion and SQL insertion is a good idea in general, because I haven’t gotten all of the SQL in clean yet, and having that file enables me to make multiple passes.

    There are a couple of changes to the mysql config file that have improved the speed of the upload: the file is located at /etc/mysql/my.cnf

    (1) increase the size of the innodb log file and log file buffer, it defaults to 5MB and that means lots of disk I/O. Here are the settings I used, note that these were conservative because I don’t have exclusive use of all 4GB of my machine’s RAM.

    innodb_log_file_size=20M
    innodb_log_buffer_size=8M

    (2) turn off log_bin — logging transactions is a great way to make sure you don’t lose data, but I’m uploading to a single machine and don’t need the overload. I comment out default settings as a general rule:

    #log_bin = /var/log/mysql/mysql-bin.log
    # WARNING: Using expire_logs_days without bin_log crashes the server! See README.Debian!
    #expire_logs_days = 10
    #max_binlog_size = 100M

    After making these changes, I restarted mysql:

    /etc/init.d/mysql restart

    Then I created a UTF-8 encoded database to download into using the mysql commandline:

    mysql -u username -ppassword (note that -p only takes a password if you don’t leave a space)

    Once in the mysql commandline client:

    create database wikipedia CHARACTER SET utf8; (make sure you put the semicolon on the end)

    Now leave the client (\q), because we need to load the database with a schema, otherwise the SQL load wont work — it’s just a table insert script that assumes tables. Download the latest schema from

    http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?revision=29128

    – this is the head at the time I wrote this, visit the mediawiki vcview page to get the latest version. This Mediawiki page provides a great overview of the schema.

    You can insert the schema into the database by doing:

    mysql -u username -ppassword wikipedia < tables.sql

    Now, with the wikipedia database schema set up, you can insert data:

    mysql -u username -ppassword wikipedia < dump.sql (assuming dump.sql is the name of the sql file you generated using MWDumper)

    Now it’s time to step away from the machine — I ran this job nohup so I could get on with my life. It’s still running, and I’ll update the blog with the results in my next wikipedia related post.


    Installing Wikipedia, part 1 of N

    January 28, 2008

    One of my mandates at work is to build up a huge repository of information taken from free and not free, structured and unstructured data sources. Naturally we targeted Wikipedia as the best seed source to get us started. After writing some very specialized table scrapers in Ruby, we decided to take another approach and just ingest the entire Wikipedia database.

    Wikipedia has a “don’t scrape our pages, just download our file dumps” policy, which makes complete sense. Because of this (and the IP blacklisting that occurs when you scrape too fast/too much), our goal is to harvest information from an internal Wikipedia site.

    Wikpedia provides a set of tools/instructions to take to install a version of Wikipedia locally, starting with processing the dump into Mysql and then installing/running MediaWiki, the software that converts the contents of the database into displayable and editable wiki content.

    That is the theory, at least. Here is what I’ve found while trying to follow these instructions: Note that some if not most of the drama below is self manufactured, but hopefully, someone, somewhere will find this information useful.

    1. There is too much information — i.e. six different ways to do the same thing — in some places, and not enough — i.e. what does an actual XML dump contain vs the SQL dumps — in others.
    2. As with most batch processes, error handling and recovery are painful. I am going to share my ‘lessons learned’ wrt wikipedia tools in a future post.
    3. There is no mapping of SQL to XML files in the download dirs — this may only be a problem because I was unable to successfully translate all XML into SQL , but it would still be nice to see that pages.sql contains text and revisions (because it’s not obvious that the other SQL files contain any of that information).
    4. There are plenty of ways to go off into the weeds. For example, I just spent 1/2 hour debugging why my sql upload failed at 68K entries. Turns out that the mwdumper output was getting folded into the sql file because I was running the job as nohup. Right now I’m going to run it as a console job, and investigate how to separate stdout from errout when nohup is invoked when I import it into the database.

    In addition, I’m also using ActiveRecord to access/manipulate the data once it (eventually) gets into the database. ActiveRecord w/o Rails merits some discussions, primarily because I’m also using it with RSpec, which I believe is a much more natural way to validate code.

      I’m going to update my daily progress here so that I don’t have to go through this again. Stay tuned!


      So much to do, so little time…

      January 26, 2008

      I’m in (geek) lust after reading over the sample chapter of Making Things Work, a book that details how to DIY a set of interconnected devices that you get to make from scratch. This book touches on the three things that made me fall in love with computers from the first time I touched a mainframe at age 10:

      (1) designing really fun devices (both virtual and physical)

      (2) making them talk to one another

      (3) creating the detailed interactions (and watching the ensuing hilarity).

      Since I haven’t even started to make progress on my exercise tracking/mapping/etc website, I’m going to have to hold off on this one, maybe put it out there as the carrot that will get me through the current home project, which I can only work on when my real job isn’t demanding all of my time, and right now that isn’t the case :(