The cost of a test

On Friday, I had an entire weekend in front of me. Nothing on the calendar other then the IWST picnic and some time to catch up on my Robert Jordan reading. (Quick aside, thanks to John McConda for organizing and hosting the IWST picnic!) By Sunday night, I had lost almost my entire weekend to isolating an issue on my desktop. Here are some things I learned...

(This is a slight variation of the truth. The real story is much more painful, but I've glossed over most of the issues that came out of updating the bios.)

First, the issue
A few months ago I built a custom desktop running 32-bit Vista. I didn't know I was installing 32-bit Vista until I got the disks home and realized that the 64-bit and 32-bit disks are packaged separately (of course - it's so hard to put that second disk in there). With a sigh, I put everything together and got the system up and running, tested it with Call of Duty 4 (a fine performance test for any system in my opinion) and gave it the new-computer seal of approval. Life was good...

The only problem with running the 32-bit version of the OS was that I couldn't use all 8 gigs of RAM I had installed. So after a couple of weeks, I sent away for the 64-bit version of Vista and when it arrived I happily installed it. Then the troubles began... About every 20 minutes, if anything intensive was happening, the system would crash.

The usual suspects
Now, this isn't my first custom system, so I had some ideas for where to start. It's a 64-bit operating system and driver support for 64-bit systems isn't that great. The first thing on my list was to make sure all the drivers were "officially" supporting 64-bit Vista. That's no small task by itself.

Second, I suspected that there might be a problem with the second 4 gigs of RAM. They had been installed while things were running smoothly, but they hadn't actually been used. So if there was a problem with one of the new chips, then I might not have seen it until now when they were finally being utilized. I figured this would be a low risk, since the chips were only a couple months old.

Third, I suspected the power supply. While I've never had one go bad on me before, I've read that the effects of an underpowered system can look just like the effects of bad memory (system freezes, etc...). I didn't think this would be the case either, because there were no real changes from a hardware perspective, and I couldn't think that changing drivers and the OS could account for a significant swing in power usage.

The cost of a test
Testing is expensive. The most expensive part of figuring this out, was my time. Looking up driver information for each piece of hardware in your system takes a huge amount of time. I spent at least eight hours researching, downloading drivers, and installing them. Each one, of course, requiring a reboot. I got a chance to get caught up on my magazine reading while doing that research. (Side note for anyone thinking about updating their bios -- don't.)

After the costs related to my time (and note, that it's not just the direct cost of my time, but also the lost opportunity costs of weekend time with my wife that I'll now have to payback over the next month), there were costs to test the memory and power supply. The best way to test if hardware is the culprit is to swap it out for different hardware. Sure there are memory testing programs, and I tried a couple, but a quick Google search will tell you that those are less then reliable and the new hardware for tests is probably less expensive overall (if you value your time) and more reliable.

So, I had a trip to the local electronics store to by a larger power supply and some new memory chips. Initially, I figured I could return what I didn't end up using. I would start with testing the memory, and then try the power supply if that didn't get things working. Total cost upon leaving the electronics store, $250 and another hour of my time.

However, when I got home I changed my mind. While I had initially thought about testing one factor at a time, I decided that I didn't really care which aspect of the system was failing. I just wanted it fixed. So once I was powered down, I replaced both the RAM (guessing at which two chips might have the problem) and power supply (adding another 100 watts).

The reason I had changed my mind was because I still wasn't sure if either of these solutions would solve the problem. I wanted to leave time for additional testing if it were required. So while testing many factors at a time was less precise and incurred a greater hardware cost, it gave me the possibility that there would be a lower cost in my total time commitment to the project.

Once things came back up, I was able to do some testing and determined that I had most likely fixed the problem. On a side note, I had one of my best rounds in Call of Duty 4 while testing.... My wife wasn't as impressed as I was, but that's a different cost altogether.

How does this parallel my work
All of this illustrated some simple principles that I use at work:

  • In general, the greatest cost in testing is the human cost. Software and hardware are cheap by comparison.

  • Sometimes you can't trust the results of an automated test. They can be unreliable and they often miss things a human won't.

  • With each test we run, there is some opportunity cost we incur because we aren't doing something else.

  • While testing one factor at a time is often more precise, testing many factors at a time can be a faster way to gather information if you don't care about that precision.

At this point it appears I have a working system again. I'm not sure it was worth all the effort since I have a couple of fairly nice laptops, but for most things around the house it's the system of record. It also supports my late-night gaming addiction. Perhaps I'll get a chance to do another "CoD4 endurance test" tonight.