Starting Tuesday, July 29 at approximately 15:10 PDT, Strava experienced an outage that disrupted all uploads for almost 14 hours. For a period of about 35 minutes, all uploads to Strava failed before we determined the root cause and put the site into maintenance mode to determine the best course of action. We brought the site back into production at 16:10 PDT, but with all background job processing disabled. This allowed most interactive parts of the site to work. Uploads to Strava were captured, but were not processed until we were able to fix the underlying problem. The problem was fixed early Wednesday morning and all uploads were processed by 7:00 PDT.
When we process an upload, we identify and store multiple streams of data. Streams are time-indexed sequences that contain data for location, distance, speed, and so on – the core part of any activity recording. While the stream data itself is stored on Amazon’s simple storage service (S3), we do store some stream metadata – in particular, which streams are associated with a given activity – in a table in a database. Each row has an auto-incrementing identifier which was stored as a four byte signed integer. Although the range of signed integers goes from -2147483648 to 2147483647, only the positive portion of that range is available for auto-incrementing keys. At 15:10, the upper limit was hit and insertions into the table started failing.
We discussed a variety of approaches for tackling the problem, ranging from keeping the site down entirely to emergency code changes to messing around with database internals. In the end, we felt that allowing uploads but temporarily turning off upload processing was the approach that carried the least amount of risk while still allowing the site to remain generally operational as we fixed the problem. Around the same time that we brought Strava back online, we started running a database migration to change the type of the identifier from a signed integer to an unsigned integer. Unfortunately, we were not able to complete the migration in place so we brought up a fresh database slave, performed the migration there, and then proceeded to fail over to the new slave. The amount of data in the table was such that the migration ended up taking about 9 hours.
What We’re Doing
Normally we would have weeks of advance notice for this type of issue – we have in the past run many migrations to adjust data types or make other schema changes – but we did not have keyspace monitoring in place for this particular database. We have already conducted an audit of all primary keys and will be preemptively running schema changes to change all remaining signed integer primary keys to unsigned integers. Additionally, we have already put in place additional monitoring and will be reviewing keyspaces as part of our ongoing production processes.
We will also be reviewing our ability to monitor and effectively control failing uploads. At the current time, we only have an overly broad way of disabling uploads. We need to be able to instantaneously disable either uploads or upload processing and to provide the appropriate messaging in our applications.
We recognize that we did not communicate with you, our users, nearly as effectively as we could have. We know that transparency and prompt updates are important, and as such, we are starting to work on ways to improve our communications. This includes streamlining our internal communications and making it easy for notifications and updates to be posted to various locations including strava.com and our mobile applications, and our twitter account. We are also putting together a site that will include up-to-the-minute information on the availability of Strava.
We are athletes and users of Strava ourselves. It is an important part of our personal lives as well as our professional ones so any problem with Strava frustrates us as much as it does you. We are continually working to ensure the highest level of availability and reliability. However, there are times when things either fall through the cracks or impact us in ways that we had not anticipated. Each of these incidents gives us the chance to critically re-examine ourselves; to improve our infrastructure, practices, and processes; and to come back stronger than we were before.