One of the first scaling challenges we encountered at Strava was maintaining real-time segment leaderboards. There are a number of questions that we ask of a segment leaderboard, such as:
- Who’s in the top 10?
- Who’s ranked 150 through 175?
- What is my rank?
- Who is ranked near me?
Ideally all of these questions can be answered efficiently, e.g. better than O(N), where N is the size of the segment leaderboard (which is also the number of unique athletes that have efforts on the segment). As the total number of efforts across all segments tallied over 100 million, we began to see performance issues:
- Some leaderboards were becoming difficult to cache. On a busy day the contents of popular leaderboards were shifting frequently with new riders and new PRs, forcing us to rebuild them often. Well-travelled segments were also often the most viewed. (NOTE: PR = personal record, or an athlete’s best time on a segment. Only PRs appear on a leaderboard.)
- Our queries were slowing down as leaderboards grew. Ranked queries in SQL are notoriously convoluted. These queries began bubbling up in our slow query logs for popular segments, which in some cases now had over 50k efforts and over 3k unique athletes.
- Our database cache was thrashing. We couldn’t keep in memory all the indices used by the leaderboard queries. As our indices hit I/O things really started to get bad.
To give you a sense of things, a table which would hold all efforts on all segments across all athletes and activities, including leaderboard worthy efforts (PRs), would include these columns:
segment_efforts(id, segment_id, activity_id, athlete_id, elapsed_time)
…and have a compound index on [segment_id, athlete_id, elapsed_time]. There are a number of different ways to write a query to generate a leaderboard for a particular segment. None of them are awesome – the way below is as good as any:
SELECT MIN(e1.id) id, e1.athlete_id, e1.elapsed_time FROM segment_efforts e1 INNER JOIN ( SELECT athlete_id, MIN(elapsed_time) fastest_time FROM segment_efforts WHERE segment_id = ? GROUP BY athlete_id ) e2 ON e1.athlete_id = e2.athlete_id AND e1.elapsed_time = e2.fastest_time WHERE e1.segment_id = ? GROUP BY e1.athlete_id, e1.elapsed_time ORDER BY e1.elapsed_time, e1.id
Lovely, right? As a side note: you’ve officially been using Rails too long if seeing pluralized table names (e.g. segment_efforts vs. segment_effort) no longer registers some disgust.
To support our growth and improve performance, we needed to denormalize the leaderboard data (or find a bigger box or shard, but I’m going to skip those discussions). The first obvious place to turn was back to our database:
- We could add a sparsely populated pr column to our table, along with an accompanying index ([segment_id, pr, elapsed_time]). We’d then update this column as necessary when new efforts are inserted. This would involve updating the table (and index) much more frequently than before. The table was already our hottest table for writes, so this wasn’t attractive.
- Alternatively, we could manage a separate leaderboard table, which would only store PRs, with a unique index on [segment_id, athlete_id]. Aside from creating another write heavy table, there were some queries like ”What’s my rank?” which would remain inefficient. Keeping a Rank column up-to-date was a non-starter, as there would be way too many updates.
These solutions came up short. Here’s where Redis fit in. We were already using Redis for our background jobs, and its Sorted Set data structure mapped very nicely to our problem. Each segment would have its own sorted set, with the AthleteID as the member and the athlete’s fastest elapsed time as the score (see zadd). All of our queries (“Who’s in the top 10?”, etc.) should have better than linear run-times.
A sorted set can only hold two values per entry (athlete_id, elapsed_time), so we still need a place to store the ID of the actual effort to which these values map. We could go back to the database to look these up (recall our index on [segment_id, athlete_id, elapsed_time]), but instead we decided to store this information in a complimentary data structure in Redis, using a hash for each segment, keyed by AthleteID. For example, retrieving the top 10 leaderboard for a segment would look something like:
athlete_ids = zrange(<sorted_set_key>, 0, 9) effort_ids = hmget(<efforts_hash_key>, *athlete_ids)
Using Redis did introduce some new challenges. Keeping the efforts hash up-to-date and in sync with the leaderboard sorted set required some extra book keeping. In theory you’d want these updated in the same transaction, but Redis doesn’t quite have the equivalent database concept. Instead, we could do these updates holding some sort of exclusive lock (per segment), or we could use Redis watches and retry on failure. We chose the later.
Finally, we have the issue of leaderboard data now living in two separate stores: our database and Redis. In theory we’d like to wrap related updates to both of these in a coordinated transaction, but we’re not launching missiles into space at Strava… So, we must assume there will be drift. Our database is still the “system of record”, so as we discover dirty data in Redis, we lazily correct it. We also make the Redis updates dependent on the database transaction committing, so at least we can reduce phantom data in Redis (entries for efforts that were never actually committed to the database).
When storing data in Redis we found it was important to adopt sane key namespaces. If you just throw non-expiring keys into it willy-nilly you’ll have a hard to cleaning them up. Everything must fit into memory with Redis, so you’ll want to be judicious about how and what you store in it. There are some wildcard queries you can use to find keys, but you wouldn’t want to troll your entire key space with them. In general, a nice model for us has been to think of our data-sets in Redis as indices.
Did I mention yet that we love Redis? From a systems POV it has been awesome: replication is easy, memory footprint is small (we do optimize for this), and the CPU is largely bored. Today all of our leaderboards fit on a single replicated Redis box using under 16GB of memory. As our data-set grows we feel comfortable sharding it as our access patterns are already so straightforward (thankfully Redis forces you to keep things simple).
Since our initial transition to Redis we’ve added filtered leaderboards (by weight class and age group) and date-range windows (this year, this month, today). These features introduced some additional layers of complexity, but in general we see this design scaling well in the future.