I’ve been working on a little page scraping project , nothing to write home about, nothing as hip and cool as MapReduce, but still a good learning experience. Today I learned a couple of things that I already thought I knew:
(1) Always pick the right tools for the job.
I was trying to extract chunks of text from an HTML page where the text spanned several DOM nodes that could be on different tree branches in an arbitrary DOM. The chunks were delimited by known words, that was my only real hint. Because I had been using HPricot to get down to the nodes, I continued to use it to try and pick out the nodes I wanted to scrape text from. This was really, really painful, basically I was writing an algorithm to traverse part of a tree, track the depth in the tree, and grab text under a set of nodes that were at the same level but on different tree branches. I was so wrapped up in getting this algorithm correct that I ignored the little voice inside my head that kept saying “this is too hard…”. Finally I grabbed Phil and said “I know I’m working too hard at this. The algorithm is correct but it’s a complete clusterf*ck to maintain. Got any ideas?”.
Phil looked at it for about 5 seconds and said “why dont you just use a regex to split the text by the known words?”. Duh. A regex is a much better way to get text that cuts across the DOM, especially when I have known markers I’m separating the text by. Where my original algorithm took 30 painful minutes to code, the regex call took about 5 to code up, and it worked the first time. Which brings me to my second lesson of the day:
(2) Less is more
If I can only code what is absolutely necessary, and no more, my code is by definition elegant. I find that TDD helps me stay away from building code cathedrals. Some people do this naturally. I find that it takes a lot of discipline to not overdesign. It also takes a village:
(3) A second (and sometimes third) pair of eyes saves hours
I used to believe that coding was a solitary, metaphysical exercise, and I was like some kind of enlightened being dispensing my wisdom in elegant logic capsules. My code as a result was ornate, baroque, and buggy. I firmly believe that the best code I have written in my life was the code that received frequent reviews, or at least code that was written in close collaboration with others. A lot of pretty screwed up design can be avoided just by having someone to talk to. Later on in the day I was trying to track down a bizarre error centered around a module variable that the interpreter said was uninitialized. I stared it the screen for a while, then actually listened to the little voice inside my head that said “go ask someone else to take a look”. Alex came along and in short order we found a place where I had a circular ‘require’ statement, i.e. I required class_1 from class_2, and required file_2 from file_1. The issue never manifested in unit testing because class_2 and class_1 where tested in isolation. But in a unit test of class_3, which required both class_2 and class_1, the bizarre error showed up. Just having someone there to bounce ideas off of accelerated the debugging process and got me unstuck.
I am currently in rapture with the active support gem, especially the CoreExtensions::NumericSupport::Time class, because it saved a ton of time for me today. I had to write a quick script to allow someone from another team to request a resource from our web service and save it to file. The download part was two lines of code, but they needed to request that the resource be generated from all new content that had been ingested X minutes/hours/days ago. No sweat, right? Not so fast..how about those boundary conditions? For instance, if I request a file from 3 minutes ago and it’s 12:01 AM, that flips the hour and the day back.
Yes, there is a very elegant recursive way to do this, and I will code it out when I am resting on the beach after a killer dawn patrol, but like they say “time is money”, and today I had neither. Taking heed of the lessons I had already learned in the day, I asked around, and was pointed to the active support documentation on http://gotapi.com. The Time class extends Fixnum with methods like minutes(), seconds(), etc that return Fixnums. It also extends Fixnum with the ago() method, which allows you to write code like this:
y = 10.minutes.ago
Talk about coding by intention!