Aaron N. Tubbs bio photo

Aaron N. Tubbs

Dragon chaser.

Twitter Facebook Google+ LinkedIn Github
  • I learned that the six-wheel formula one car is not dead. I think it would be interesting to see a complex drive train in the form of a six-wheel car … with inline duals in the back, and a single for steering in the front. It would be a menace to turn without some serious computer assistance (I’m pretty sure mechanical differentials can’t solve all of the problems), and I don’t have a clue if it could ever perform outside a straight line without some form of high-speed all-wheel steering, but the launch traction would be incredible, especially if you bias the rearmost tires so they’re riding a little higher on the suspension so they don’t really get in the way until you really plant the accelerator and shift the weight balance onto them…
  • I learned that even in Connecticut, going from 50 to below freezing is impossible in a short amount of time. I’ve not completely lost the Midwest.
  • Quicksilver is even cooler than I imagined. If you’re not using it, you should be. Of course, that means you need a mac. If you don’t have one, you should. I’m a convert for life. Down with windows. Linux on servers. Mac forever!
  • Vim can save your life, even when the power goes out. Swap files forever! Ok, so it didn’t save my life, but when the power went out, I was just like “eh, that’s ok.” Everybody else was yelling and groaning. I saw it as a mandatory coffee break.
  • You can’t teach a developer anything if they don’t have the fundamentals. This probably isn’t true, but I’m bitter. This guy is trying to work with locking constructs … it’s abstracted out a ways, but the end of the day is there are a pool of processes, a pool of tasks, and job control is done via files on a network share. Safely mutually excluding is hard in this scenario, but it’s not impossible. Right now we’re seeing 1-5% performance attrition in our system because of duplicated work, because things aren’t mutually excluding. I pointed out that locking the lock file, rather than just creating one and trapping an exception when one tries to create the same file again, is probably a good first start in refactoring the approach to solving this problem. So after much complaint from my side, explanation of the idea of a TSL, and explanation of the problem, the new approach was proposed:
  1. Create a lock file as a filestream
  2. Write process ID and machine name into the file
  3. Open file for reading
  4. Lock file
  5. Verify contents are as expected
  6. Unlock file
  7. Close file
  8. Do work
  9. If contents were not as expected, try another unit of work
  10. If an exception was thrown when we tried to create the filestream, try another unit of work

Yeah, so what did we solve? Well, for reference, here was the old approach:

  1. Create a lockfile as a filestream.
  2. Try a different unit of work if filestream creation tosses an exception.

So we may have improved the statistical likelihood of this process succeeding, but we create the following problems:

  1. The unit of work may actually not get processed now, if filesystem modifications are not synchronous (we’re dealing with a globally distributed application, some of these crazy scenarios are more likely than you would think).
  2. Nothing prevents a delayed write command from clobbering the data outside of the 3 milliseconds in which the lock is held.
  3. Nothing prevents the open from succeeding for two separate processes, the lock succeeding for one of them, and then the lock failing, but having nothing to handle that circumstance.
  4. The way the lock was implemented was blocking (non-blocking was an option), so now we manage to hang the other process until the work is done.

So, we gave ourselves a gambling shot of improving reliability, and introduced four new significant problems. I should have known; I felt bad for demanding to see the code as to how this was implemented, until I saw what was done.

As I said, this is a hard problem to solve. Best way to approach it is to bring some people together to talk about the problem and come up with a novel, tested, and robust solution; everybody should be working to find holes in the approach. Ok, actual best way to approach it is not synchronizing a large global distributed system with a filesystem, a daily data flux of 30GB of XML and a host of Microsoft technologies just because it sounds sexy, and instead using something that is built for fast controllable data access … like a database.

I’m not confident I could solve the problem correctly in my first attempt, despite all of my banter and criticism from greener pastures. But, with all of that said, I paid attention in my schoolwork, and did well in my operating systems class, and understand the ideas of synchronization. Either you get that, and it makes sense, and you can solve these problems, or you don’t get it, and you keep finding creative ways to make things fail in different ways without ever resolving issues. This is why I believe one needs either gobs of intuition, insight and experience, or a solid foundation in the theory behind computer science (and I’m not saying graduate-level here, just a good undergraduate program) before they should be allowed to do software development on something tricky, like a distributed system.