Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tuesday's Heroku outage post-mortem (heroku.com)
69 points by bscofield on Oct 27, 2010 | hide | past | favorite | 41 comments


I think it's fascinating that a single engineer, who they had on staff, was able to write a patch in one night that improved performance by their messaging system 5x...

... and hadn't already done it.

This isn't meant as a slight against Heroku at all. They've got an incredible team of engineers. But imagine if Ricardo had said "hey, I could write a patch today that would speed up our messaging system 5x, should I do it?" the rest of the team would've said "OF COURSE!"

It reminds me of what happens to your brain when you launch a site. Even before you get feedback, somehow the knowledge that people can use it drastically changes your motivation system. Things that before seemed important are obviously not. Other things that were invisible before become the singular focus of your resources.

Maybe we should do fire drills...

* Your requests are suddently taking 100x as long to complete. Go!

* Your "runway" disappears due to an accounting error and you have 7 days to turn a profit. Go!

* 50% of people visiting the site have no idea how to use it. Go!

How could we achieve the focus and clarity that a crisis brings on, without having the crisis?


The thing about performance tuning is you don't often know what your bottlenecks will be until the system is under load, what to patch until you dig deeper, and what the magnitude of improvement will actually be.


I understand the gist of your post, but I don't quite understand where you get 1 night from. From what I can tell the 1 night thing was to fix a bug, not improve performance.

Regarding the 5x increase:

> One of our operations engineers, Ricardo Chimal, Jr., has been working for some time on improving the way we target messages between components of our platform. We completed internal testing of these changes yesterday and they were deployed to our production cloud last night at 19:00 PDT (02:00 UTC).


Ah, you're right. I was reading too quickly. :) The general point stands, but yeah. Not a good example.


I've improved our system here likewise, and in short order. It comes down to working on the most pressing task.

I could spend all my time working on optimizations to the system, or I could work on things that will bring us more money. The company would rather the latter, I would rather the former.

But here's the key: Until there's a problem, you don't always notice the small things you could do to give a huge performance increase.

I also don't think mental drills are the answer. When I find a bottleneck, I don't just sit there and think. I'm out reviewing logs and watching the actual performance of the system. Without any details as to what's happening, how could I possibly find the real bottleneck?

Storytime: Once upon a time, we had a customer who abused our system. He submitted more data than all the rest of our customers combined. I loved him, because he showed us all kinds of bottlenecks. Management didn't love him because his stress on the system would frequently cause issues for other clients. Even better, the stress he put on the system really was just like regular clients, and not some abnormal stress. By the time they let him go, I had found so many bottlenecks that we didn't have another system issue for -years- after he was gone.



I'd like to point out the lesson that other industries can learn from IT infrastructure companies.

Heroku sells a technical product to a technical audience. They're foundational to their clients' products. So when something goes down, there's only one option: explain, in excruciating detail, exactly what happened, why it happened, and how it's going to be fixed in the future.

Why? Because their clients can smell bullshit better than a purebred bloodhound. Too much bullshit means it's time to move on.

Beyond being the right thing to do, being accountable is essential to trust. When you fuck up, it will piss people off. That's just life – everyone makes mistakes. So you need to be the guy where people can say "Okay, there was a fuck up, it was bad, but look at how hard these guys worked to fix it. Check out their plans to prevent it in the future."

Luckily, the incentives are aligned here to make this mostly non-negotiable. When you get medical malpractice, a financial meltdown or an oil spill going on, the cover-your-ass impulses are much more compelling.

Even in those cases though, I insist we need to encourage a culture where accountability and transparency are rewarded. Because, for me, accountable guys are the kind of people I want to do business with.

I dunno much about scaling a Rails server, but for now, at least, I know the Heroku guys are the sort of people I'd trust.


there's only one option: explain, in excruciating detail, exactly what happened, why it happened, and how it's going to be fixed in the future. Why? Because their clients can smell bullshit better than a purebred bloodhound. Too much bullshit means it's time to move on.

Okay. I feel a bit sorry for bashing heroku here, but I'll bite.

If I was a heroku-customer then I'd feel, ahem, a bit washed by their idea of "excruciating detail".

So their "internal messaging system" triggered a bug in their "distributed routing mesh". And they applied a "hot patch".

Great. As far as I am concerned they could as well have written their flux-compensator overheated because the pixie-dust exhaust got clogged with rogue bogomips.

I applaud their willingness to talk to their customers at all. But please... either explain what was going on in a meaningful way - or just leave it at "we screwed up and promise to do our best to prevent it from happening again".


> But please... either explain what was going on in a meaningful way

Some of us like a technical breakdown and feel warm fuzzy reassurance. If a few people got confused after the first paragraph, it's less harmful than appearing to bullshit technical users.


He's not saying it was too technical, he's saying it wasn't technical enough.


Yes, sorry if that was unclear.

In less snarky words: Even facebook told us quite clearly _how_ they screwed up the other day (the config management issue). In contrast this heroku article was disappointing.


Was that really excruciating detail? All I learned is that they had some bug in their messaging system and they screwed up while trying to fix it. Their postmortem pales in comparison to the recent facebook and foursquare postmortems.


I'm curious to know what else you would want to see there. I felt like it was a reasonable balance between writing something brief enough to be worth reading while still sharing the places where they themselves screwed up. It's pretty common practice, for example, to admit no wrongdoing whatsoever.


I don't know if I'd describe this postmortem as excruciating detail. I'd like to know more about what products they're using, how the "mesh" become overloaded, what their fix was, etc.

NASA reports have excruciating detail. This felt a big vague.


Will someone at Heroku please describe your QA process?


I like this question and I hope more startups/companies explain how they maintain quality. There are far more stories about how "I code this in 24/48 hours/1week/3week" but less stories about "this is how we maintain high-quality code" or "this is our automation strategy".


I am not sure this is the right question to ask.

The problem is that you can build an elaborate maze of QA checks and still miss problems like these (and kill the company's ability to innovate in the process).

The reasons why you have (or don't have) various QA processes are much more interesting than the processes themselves.


Indeed. The underlying assumption that a "QA process" as traditionally conceived would have prevented this sort of problem is fairly faulty, I think.


I don't care for Heroku, but this is over the top: distributed systems are complicated, to build and especially to test. Even Google gets it wrong: http://gmailblog.blogspot.com/2009/09/more-on-todays-gmail-i... is not entirely dissimilar.


How is it over the top? I am genuinely curious about their QA process. I am not judging them.


I don't know much about Google "process" but usually a company has a set of minimum rules about quality and code standards. The individual team usually have more rules on top of that.

Why? Cause everything depends on the type of problems that people have to face.

So to say that "Google gets it wrong" as the whole company doesn't seem to fit well. The "GMail" team didn't catch this one.

And I understand that testing is hard and certain things might not testable/doable. But to know how Google try to minimize the impact by doing "something" is far more important.


Are they going to remove "rock-solid" from their front page copy?


Heroku had 45 minutes of downtime in August, 28 in September, and 45 in October so far. (Source: http://groups.google.com/group/heroku/browse_thread/thread/f...)

That's 99.90%, 99.94%, and 99.88% (for the month so far), or simply 99.91% for the entire period.

So, what would you consider "rock-solid"? Personally, I'll echo what was said in the other thread - 99.91% is much better than what I could accomplish on my own, so I'll continue to trust my business to them.


They had over an hour of downtime yesterday alone, one 8m outage and one 1h15m outage. (Source: http://status.heroku.com)

Regardless, 99.9% is really not special or even acceptable from a high dollar host like Heroku. Cheap shared hosts like HostGator can give you 99.9%.

I'm pulling for them, but they've got some work to do.


Agreed. I am happy with Heroku so far (I am pre-launch so the effect hasn't been great), and intend to stick with them for at least a while (I might have most of two decades as a dev in this business, but zero of those are as a sysadmin), but 45 minutes is low:

I may be showing 48^ errors in my per-minute checks but those errors occurred in at least^ nine separate blocks across an eleven hour window.

In other words, we are probably looking at around ten hours of issues, even if not outright outage.

^ I disabled the checks at times because it was just a waste, so the true stats are likely worse.


"cheap shared hosts" don't provide you with the same sort of infrastructure that heroku gives you.


Of course, my point was that 99.9% is expected on even the cheapest setups. Premium hosting like Heroku should at least be able to deliver comparable uptime to a cheap shared host like HostGator.

However, Heroku appears to have no SLA, so its a moot point anyway.


I think it illustrates more that people don't realize what 99.9% actually is. Being down for an hour during a working day is a huge deal to most businesses, but they blithely think 99.9% is awesome availability.


The gold standard of availability is generally "four nines" - 99.99%. To me:

99.5% (or less) = unacceptable 99.9% = average 99.99% = exceptional

0.5% downtime = 216 mins/month (almost an hour a week!) 0.1% downtime = 43.2 mins/month 0.01% = 4.32 mins/month

Getting to that last level is certainly very hard, I hope Heroku will get there, but the service is relatively new, fairly cutting edge, under active development, etc. I would not entirely expect them to reach "four nines", but I hope it is their goal.


You got downvoted (probably for snarky-ness) but I agree with your sentiment.

These heroku outages are slowly turning into a running gag. I'm not even hosting anything on heroku, yet this is the 3rd time I hear about a multi-hour outage this year. God knows how many outages I didn't hear about.

So, "rock-solid" does indeed sound a bit out of place here.


Does Heroku use anything like 5 Whys to incrementally address organizational-type causes?


After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service.

To me, it seems like they just needed to apply the "Hot patch", instead they panicked(?) and did a lot of unnecessary version control gymnastics, which delayed the bug fix.


I've written mostly client code, and watched server action from the sidelines, but jumping straight to the hotfix only seems obvious in retrospect to me. Rolling back to a known-good state is the safe approach – it just didn't work in this case because of a surprise incompatibility with another system.

If you jump straight to the hotfix, you're basically enlisting the entirety of your userbase to join you in a round of QA, which could be sub-optimal if your hotfix ends up causing some other unintended consequence.

Right?


Rolling back to a known-good state is the safe approach

Absolutely.

However, they rollback should be atomic, which means all pieces of the infrastructure/code should be rolled back to a known-good-state.

When I said "gymnastics", I meant for rolling back one piece of code, only to find incompatibility with other pieces.

I do not intend to judge them, knowing its difficult not to panic in the difficult emergency situation, but not working effect out on paper (or not knowing, which versions of software components are inter-compatible), before actually doing it on code, look pretty novice to me, for a company of heroku's scale.

I really hope this is not an unfair criticism. (handling emergencies are difficult)


The trouble is that they have messaging servers, a routing mesh, and caching servers which are all loosely coupled and deployed on separate boxes. They could take down all dozens of pieces of their infrastructure and roll them all back to wherever they were on that previous date, but this is not better for several reasons:

1) it take much longer than just rebooting the isolated service. Can you imagine Google shutting down every one of their multi-million boxes, rolling them back to a previous state and spinning them up again?

2) they'd still be at risk for incompatibilities with their databases, etc. the problem with unexpected imcomptabilities is that they're unexpected.


All I am trying to say, it only makes sense to do/know effect analysis of your changes, before actually doing it.

I fail to see, why they need to revert a piece of code, and then realize, OMG... this version of code does not fit well with the rest of architecture, now change it back.

1.) I do not expect Google (or even a small shop, like my place) to revert any piece of code which is not affected.

I, HOWEVER, expect to know what changes I am EXACTLY doing, and what to EXPECT after the changes.

(It should not be black magic, for historical code).

2.) I fail to understand this, why should this the case for older code? I can understand some tricky/edge/minor cases, but whether the architecture/database etc. (major compatibility) is compatible or not, should be possible to calculate BEFORE doing the changes.

I hope I am not over-trivializing the issue, but I still cannot get my head over the approach.


To me, it sounds like they rolled one software component back to a known working version.

That's not unreasonable. And doesn't sound like "a lot of unnecessary version control gymnastics" to me.

Of course, the problem was that the version they rolled back to was incompatible with the current version of their other services.


How is rolling back one of software component, without knowing/calculating whether it will "play nice" with the rest of the architecture/services is acceptable/reasonable?

Am I missing something here, please let me know?


Since what they had was already broken and customers were down, they probably didn't have the hours it would take to correctly verify the older version of the broken component would work with the rest of their system.


They ended up reverting and undoing revert due to non-verification. Also, paper calculations to be certain about the effects, are not supposed to take a lot of time.


I think it's really cool that Heroku is so transparent about their outages. A lot of companies try to cover them up or blame them on someone else.

It's refreshing to see a company that not only acknowledges their outages, but even has a list of all past issues and outages. This transparency can only help them to become better in the future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: