Post Mortem: Downtime During April 23rd

At some point today, your timeline might have looked like something like this. Completely empty. That was our mistake. What happened only comes to show how delicate systems are and why quick recovery matters.

What our users normally don’t see is what our backend developers are doing for most of the time – countless rewrites and data fixes. This is exactly what Artur was doing for the new custom colors feature we rolled out this morning. In the course of that he accidentally ended up executing wrong query which had a side effect of deleting all the tasks. A human mistake.

The solution? Restoring the whole database from backup right before the point of loss. Recovery time from detecting the bug until we were up and running again was 1 hour and 9 minutes.

During the recovery, we switched to another database to make sure our data is up to date with the downside of losing an hour worth of data.

To stop people from editing their workspaces, we had to take down the API. Why? Because all the data from that period of time would have in fact been lost anyway.

What did we learn? To double and triple check. And that everyone makes mistakes, which is ok. As long as we manage to recover and learn.

And we also thought of a solution which would enable us to communicate with our users even when our API is down.

