Sunday, October 17, 2010

DIGGing a hole with Cassandra

Baseball players know to never make the first or last out of an inning at third base. Punt returners shouldn’t field the ball within the ten yard line.  Mathematicians can’t divide by zero. Rap songs typically contain sixteen bars per verse. Screenplays have three clearly defined acts with the inciting incident occurring around page twelve. These rules, written or unwritten, are the byproducts of years of experience, passed down from those who came before us.  Actually these aren’t really “rules” per se.  “Fight Club” had rules - I’ve just violated the top two.  So yeah, the aforementioned aren’t exactly rules – they are "Best Practices".

A Google search of “Computer programming best practices” returns 17.6 million results.  Code duplication, not sanitizing user input, writing code in Visual Basic – these are all big no-nos.  I kid Visual Basic, I kid…well not really. However, even with new methodologies like Test Driven Development, or with endless information on the internets from developer blogs or Q&A sites like StackOverflow; you and your team have every opportunity to throw caution to the wind, ignore the 17.6 million results and do your own thing.

I present to you Digg version 4.  

The Digg v4 team committed a mortal sin. It’s the sin that tempts every single developer who has ever written a line of code. This tempting morsel of awfulness hangs from the tree of software engineering begging you to take a bite. It’s the one thing every developer wants to do to an existing codebase but shouldn't. It’s the “there but for the grace of god go I” moment for many of us. It’s the one surefire way to completely screw the competitive advantage of your company and it’s the biggest mistake you can make as a development team:

Rewriting everything from scratch.

On March 7, 2010, nearly five months before the infamous v4 launch, Digg’s VP of Engineering, John Quinn, wrote a blog post about moving away from SQL and into the “NoSQL” approach.  The NoSQL approach he’s referring to is Cassandra.  I’ll explain later.

He started his post with the following statement:

“The last six months have been exciting for Digg's engineering team. We're working on a soup-to-nuts rewrite. Not only are we rewriting all our application code, but we're also rolling out a new client and server architecture. And if that doesn't sound like a big enough challenge, we're replacing most of our infrastructure components and moving away from LAMP.”

John Quinn is no longer employed by Digg. 

As Joel Spolsky puts it; code doesn’t get rusty.  It doesn’t go stale.  New code is inheritably worse than old code because old code has been thoroughly tested in the real world. While Digg struggled with scaling issues, you don’t tear down the house and start from scratch.  You don’t do a “soup-to-nuts” rewrite on a code base which has garnered your company over $50 million in funding and employed around 100 people at one point. You definitely don’t bet your entire career and future of a company on Cassandra, an untested open-source distributed database system.

Cassandra was developed by Facebook, a site which knows something about scaling issues. Facebook uses Cassandra to power their Inbox Search feature; it does not handle the bulk of their database activity.  Twitter uses it as well, but not for tweets. What should that tell you?

Digg launched and Digg crashed.  A lot.  Like really bad.  When the coolest part of your launch is the nifty error page you’ve concocted, something went horribly wrong.  They couldn’t roll back to their previous working state due to the entire architecture of the site changing.  And while they devoted all their resources to just stabilize the site, they couldn’t respond fast enough to their power users' outcries and Reddit ate their lunch.

Digg traffic is down 26% since the v4 launch.
Making the daily Top 10 list now only requires 200 diggs.
Again, John Quinn - no longer employed by Digg.

So what should Digg have done? If not rewrite everything, then what?  If they were having serious architecture problems, then refactor the code, don’t rewrite it all.  You make small changes over time instead of one gigantic update that changes your entire code base.  Perhaps switch the “Upcoming Stories” section to use Cassandra and see how she handles the load.  When you make small changes like this, your overall risk decreases, but your team can take bigger risks because rolling back is much easier.

Speaking of changes, what version is Facebook on?

Exactly.

They don’t do “soup-to-nuts” rewrites.  They don’t make gigantic changes all at once. They just roll out updates.  Users usually never know of them until they suddenly can’t find the "Poke" button anymore. Though, that’s probably a good thing.

I like Digg and I’m sure John Quinn is a good programmer.  I also liked Netscape.  Remember them?  They decided to rewrite everything at one point, too.  What version is Netscape on?

Oh yeah.