The fitness/freshness chart, first released in March 2013 is one of the many powerful training tools available to Strava premium members. Cyclists can use this feature to monitor their fitness and fatigue levels. According to the underlying model, both fitness and fatigue increase with each ride, by an amount proportional to the intensity of the ride. Between rides, both quantities decay exponentially.
The intensity of each ride is measured using Training Load, calculated from weighted average power. Thus, power data is necessary to generate the fitness/freshness chart. Also, for the model to give comprehensive and realistic estimates of fitness and freshness, all of an athlete’s rides need to have a Training Load.
Unfortunately, not many cyclists have power meters. Even among those who do, few use their power meters for every single ride (for instance, they might have a power meter on their road bike but not their mountain bike). In order to make this powerful tool work for cyclists without power meters, and to fill in data gaps for cyclists with power meters, we need a way to estimate the intensity of a ride which doesn’t depend on power data. Fortunately, Strava has such a metric: Suffer Score!
Suffer Score and Training Load both represent the “intensity” of a ride. However, the two quantities are calculated in different ways, using different input data. Suffer Score is a weighted sum of the time spent in each (athlete-specific) heart rate zone, while Training Load is calculated from the weighted average power of an entire ride. So while we expect a ride with a higher Suffer Score to also have a higher Training Load, it isn’t obvious that Suffer Score is a viable substitute for Training Load in the fitness/freshness chart.
To better understand the relationship between Suffer Score and Training Load, we can look at Suffer Score and Training Load data for individual athletes. The figures below show the relationship between Suffer Score and Training Load for all the uploaded rides for three prolific Strava cyclists. Lines of best fit are superimposed in black.
These plots show a good correlation between Suffer Score and Training Load for each individual athlete (for a larger set of cyclists, R^2 values varied between 0.75 and 0.9). However, it’s also clear that the relationship varies from person to person — for example, the slope of the best fit line varies between about 0.9 and 1.9.
Based on these charts, we draw two conclusions. The first is that in general, Suffer Score is roughly proportional to Training Load. Because of the formulas used in the fitness/freshness model, this means that if we simply plug Suffer Score into the model where Training Load is usually used, we will get curves with the same shape as the original model (with an overall scaling factor). As a consequence, the Suffer Score-only model will reflect processes such as tapering and peaking in the same way that the Training Load model does.
In addition, determining the best-fit line relating Suffer Score and Training Load for a particular athlete lets us predict one using the other, for an individual ride. So if an athlete who regularly uses both a heart rate monitor and a power meter goes on a ride without a power meter, she can estimate the Training Load of her ride from its Suffer Score.
There is certainly considerable uncertainty that comes with using Suffer Score to estimate Training Load. However, the error in the fitness/freshness model caused by completely omitting an activity dwarfs the error caused by using an estimated value.
The fitness/freshness chart on Strava has been updated based on these ideas. For premium members with no power data available, we generate the chart using only Suffer Score. For those who have at least 10 rides with both Suffer Score and Training Load, the athlete-specific best-fit between the two is calculated and used to fill in Training Load for rides with heart rate but not power.
At Strava, we strive to produce quality software to serve and motivate the world’s athletes. As part of that mission, we are constantly updating and refining our user experience. And few things make for a worse user experience than the app crashing.
We’ve all experienced a crash at some point – the app suddenly disappears and is replaced by the system error dialog. This is jarring and interrupts your flow: it forces you to stop mid-thought, navigate back to where you were in the app, and finally restart what you were originally doing. Additionally, if you entered the app from a transient call-to-action (e.g. a push notification), it can be difficult or even impossible to find your place again. As developers, we try to avoid crashes in the first place, and to quickly find & fix those that slip through.
For a long time, the Android team at Strava has been tracking crash statistics via the Google Play Store. When Strava crashes, the phone displays a prompt to either report or ignore the crash; reporting it sends the stack trace and other non-identifying device information to Google, where we can examine it. Our crash counts looked reasonable – a few hundred per day across 200,000 daily sessions – but we suspected this wasn’t the whole story. To get a clearer picture, and collect data for crashes that weren’t being reported, we integrated Crashlytics – a crash reporting service owned by Twitter. We were in for a shock.
Our first major release with Crashlytics was Strava 4.1.6, which included updates to Training Videos and assorted bug fixes. We released the morning of Thursday 24 July, and kept an eye on both Google Play and Crashlytics to make sure we hadn’t introduced any unexpected show-stopping bugs. There weren’t any of the latter, but we were unpleasantly surprised to see nearly 8,000 crashes reported in the first 3 days. The crash rate only accelerated from there, putting us over 42,000 crashes after a week. After the initial ramp-up, as users upgraded, we averaged around 7,000 crashes per weekday and 8,500 on the weekends – an unacceptably high count for a company as focused on quality as Strava.
Crashlytics reported that 95-96% of the athletes using Strava on any given day were crash-free. In other words, the other 5% that day would experience a crash. App instability annoys users, generates poor reviews and increases churn rate.
We had never seen many of the crashes that our athletes were experiencing, and they often did not appear on the top pages of the Play Store crash list. For instance:
- The most prevalent crash, with 20,000 occurrences in the first week, was an IllegalStateException in the Android Text-to-Speech (TTS) system: the platform was incorrectly reporting successful binding to the TTS service. We’d never seen it before (nor reproduced since) but could eliminate close to 50% of our crashes by catching and logging the exception.
- further 3,500 crashes were due to fragment state loss in a commonly-viewed Activity.
- A number of other smaller crashes were due to not checking that a Fragment was attached to an Activity in an asynchronous Receiver before using the result of Fragment#getActivity().
Plugging the Leaks
On Thursday 14 August, three weeks after releasing Strava 4.1.6, we quietly unveiled Strava 4.1.7, a bug fix release targeting the worst crashes of Strava 4.1.6. A week later, our numbers were a lot more encouraging:
- 55% fewer crashes than 4.1.6: 4,500 in the first 3 days and 18,000 after a week.
- Averaged 3,000 crashes each weekday (after an initial ramp-up) and 6,000 each weekend day, about 3,000 less than 4.1.6.
- 1-2% of athletes experienced a crash on any given day.
It’s difficult to overstate the importance of this 3-4% improvement. It represents 6,000-8,000 fewer sessions that end in a crash: 6,000-8,000 more daily opportunities for us to impress our athletes instead of disappointing or alienating them.
The most common remaining crashes in Strava 4.1.7 were overwhelmingly OutOfMemoryErrors (OOMs). The three most common crashes, totaling 8,000 of the 18,000, were OOMs. There was still a lot of work left to be done, but now we were facing the same issues that we knew every other Android developer struggled with: a system with limited resources and aggressive memory management.
Small Fix, Big Impact
Since releasing a redesigned feed in Strava 4.0 this past March, our app has had a larger memory footprint. We’ve tried reducing the size of our remote image cache, but this did not result in any significant reduction in OOMs. We’ve made a number of other small tweaks: lazy-loading drawable resources, removing duplicate drawable loads, more aggressively recycling bitmaps, reducing common object instantiations; but OOMs kept occurring.
We observed that 25-30% of OOMs came from Google Play Services Maps. This is a known issue, and particularly affects newer devices. Since we don’t have control over Play Services image loading and caching, we needed to take a different approach.
To address our OOM problem from another angle, we set largeHeap=true  in our application manifest. This causes Android devices running API 11 (Honeycomb) and above to increase the maximum heap size when the application is launched. This performed well in internal testing, so we rolled out to 5% of our install base to see how well it would work in production. The signs were good, so we added it to our latest release, Strava 4.2.0.
On Tuesday 30 September, we released Strava 4.2.0, which included a redesign of the profile screen, weekly progress & goals, run and ride auto-pause, a home screen widget, and a new ongoing recording notification, as well as largeHeap=true. Despite many exciting changes, this release has been our most stable yet, with only 3,200 crashes in the first 3 days and 12,000 a week later (around 1,500/day) – down 71% from 4.1.6.
Today, over 99% of athletes using Strava on a given day do not experience a crash. The most common crash (accounting for 2,200 of the 12,000) is due to a known issue in a beta release of Android L. Our most prevalent OOM, at just over 1,000 occurrences, is now only the third most common crash. Among OOMs, totaling roughly 2,600 of the 12,000, 99% of occurrences are on legacy devices that do not support largeHeap. On the most common devices, we see almost 85% fewer crashes than 3 months ago.
While we’re proud of getting our crash count down, we still have a lot of work left to bring crash counts down even further. While some crashes are obscure and unavoidable, many more can still be prevented. It’s great that 99% of our athletes go crash-free on any given day, but we’d prefer to see 99.9% or even better than that. Crashlytics automatically prioritizes crashes by the number of occurrences; we’ll continue to pick off the worst offenders to make using Strava a more stable and enjoyable experience.
We’ve learned that the Play crash reporting system is limited. We missed tens of thousands of crashes a week, a difference that can make or break an app’s user experience. The data provided by Crashlytics has completely changed our perspective of our app’s stability, and given us new targets to work for.
 The Android documentation discourages the use of largeHeap, stating that it is better to fix the underlying memory problems rather than rely on the system to provide more memory. In particular, the documentation notes that largeHeap does not guarantee the system will allocate more memory: it is considered a hint that may be ignored. Our case is exceptional, though, because many of our OOMs are due to image upscaling performed by Google Play Services Maps. We rely heavily on maps and cannot control the memory usage of third-party code. Many more OOMs not directly caused by maps are often influenced by them – for instance, there may be less available heap space due to tile image caching. In addition, using largeHeap did not degrade app performance. More memory requires more expensive garbage collection, which can lead to dropped frames and a less-responsive user experience. In our testing, however, we did not see any more dropped frames than before, and the user interface was more responsive than ever.
Internationalization is the kind of problem whose solution is comparable to replacing the wheels of a moving train – it’s not acceptable to slow down the product development cycle while it is happening. In nature, Internationalization is not a feature of a product: it’s a process that becomes part of how your company operates. Marketing, product, support, design, business development, engineering: every single person and team was involved in the effort. Even our database admins had to recently update the schema of some tables to support the full range of the UTF-8 charset.
So, as an engineer, where do you start? By breaking things. Or rather: pointing out things that are unknowingly broken. Pseudolocalization helps showing which parts of your UI are not localization-friendly. A few years ago, Google developed and open-sourced a pseudolocalization library which implements a variety of schemes, each identifying different shortcomings of your UI layer. In the past year, we have maintained a fork of this library, Maven-ized it and added support for more file formats: YAML, Android XML and Apple strings.
How to give your designers a heart attack
A localization pipeline can be tedious to maintain, which is why most of it should be automated. In our case, the only human intervention in the pipeline is the translation (which happens outside our walls) and the QA. Strava maintains all of its UI translations in a separate git repository, updated every few hours with the latest work of our translators. Each product (Android / iOS / web) is then responsible for pulling the translations into their respective repositories on their own schedule. This normally happens automatically, before the nightly versions are built. New messages are pushed to our translation vendor every day at the end of the work day. Engineers have access to a dashboard to know when the translations are back and their feature is ready to ship.
Strava is running a Rails frontend and we’ve progressively tuned the configuration of our stack to meet our needs: translation fallbacks, plurals, etc… Before our first international release, we wanted to remain careful and have a chance to enable a localized experience progressively across our website. We wrote a filter which is in charge of detecting the locale and that can be included on a per-controller basis.
module I18n module LocaleAware LANGUAGE_PARAMETER = :sikrit # To be included in most web controllers module Web SUPPORTED_WEB_LANGUAGES = [ # This is where we list our supported languages ].freeze def self.included(base) if base.respond_to? :before_filter base.before_filter :set_web_language end end def set_web_language detected_language = ::I18n::LocaleAware.language_from_param(params[LANGUAGE_PARAMETER]) || ::I18n::LocaleAware.language_from_cookie(cookies['ui_language']) || ::I18n::LocaleAware.language_from_header(request.headers['Accept-Language']) || ::I18n::LanguageCodes.default_language if SUPPORTED_WEB_LANGUAGES.include?(detected_language) I18n.locale = detected_language else I18n.locale = ::I18n::LanguageCodes.default_language end end end end end class AthletesController include I18n::LocaleAware::Web def index # Make use of I18n.locale end end
We first look at the Accept-Language header to pick which language to show. It’s not perfect but it gets us 90% of the way, and it doesn’t require user intervention. In case we get it wrong, athletes may override that using the language picker at the bottom of every page on Strava. To allow developers to quickly switch languages, we also support a URL parameter which overrides any other input. We ended-up writing several filters of the sort, in order to, e.g., release beta languages to a specific set of athletes before opening them up to the rest of the world.
TwitterCldr JS is included in pretty much every page on Strava. We rolled out our own solution for managing JS translations. For simplicity’s sake, we wanted to keep YAML as a storage format and keep Rails’ message format for placeholders and plurals. We configured our asset pipeline to marshall the YAML file containing messages used in our JS code into JSON, and we wrote a small Coffeescript library that mimicks Rails’ built-in
Many more challenges needed to be tackled and some of them remain to this day: localizing content stored in our database (e.g. Strava Challenges), letting athletes submit support questions in their native language, localizing assets, supporting international payments…
In the past year, Strava has shipped in 13 languages on both mobile platforms and we’re launching our 10th language on the web today:
In numbers, internationalization took us from being natively available to a mere 400 million people to over 1.4 billion. If you are a Strava athlete using our services in English and haven’t taken notice of this, then it means this has been a success. Even though Strava is a relatively young organization, legacy code starts creeping in from day 1. Internationalization made us take a hard look at the parts of our codebase that rarely see the light of day and question past decisions – if no one can reasonably explain why things are done in a given way, it’s probably a good idea to try getting rid of it.
It took 6 months to ship the web app in French, our first non-English language. Each subsequent international launch took less and less time, down to just an hour to add support for our newest language. This is the kind of scalability and velocity that makes the product better for everyone.
Just as our athletes are obsessed with tracking and analyzing their athletic pursuits, we here at Strava are obsessed with tracking our own performance; but instead of wattage and pace, we’re concerned with activity and engagement. In this post, I’ll talk a little bit about the infrastructure that powers our internal analytics and reporting.
The system requirements for analytics are often quite different from those of powering an application, especially at the complexity and scale of Strava. Initially, most analytics were run off of slave instances of the same SQL databases used in our backend. Eventually, our data outgrew what could be handled on a single server, forcing us to partition it across multiple servers. As a result, any query which required a join between data on different servers could no longer be expressed purely in SQL. Not a problem for engineers, but a huge barrier for business analysts and other data-savvy, but non-technical staff.
Enter Redshift, Amazon’s data warehouse solution. Every night, application data from the previous day is replicated to our Redshift cluster. Additionally, most features on Strava are instrumented via our logging infrastructure. Whenever an athlete views an activity or a feed on Strava, either from the mobile app or the web, event data is fired off to Kafka, aggregated, and periodically saved to S3. This data is then loaded to Redshift, where it can be queried alongside the rest of our relational data.
Data in Redshift is not indexed in the same way as traditional OLTP databases. Instead, each table is defined with a distribution key, and a sort key. The distribution key dictates which data lives on which node, while the sort key defines the order in which that data is stored. Defining the appropriate distribution keys is essential for SQL JOIN performance. In our Redshift cluster, wherever possible, data is distributed using an athlete’s unique ID. All data relevant to any given athlete lives on the same node, making the majority of our analytical queries quite fast.
Having all our data available in one place greatly simplifies the task of asking questions about athlete behavior. As an example, here’s a tidbit I pulled recently showing Strava usage over 2014 by day of the week. The orange line tracks the count of uploading active members by day, while the blue line tracks non-uploading active members (e.g., someone who has not uploaded an activity, but still logs into Strava to view/comment/kudo an activity).
As you can see, a fair number of people are browsing Strava, even when they aren’t being active. This is especially true of Mondays, and to a lesser extent, Fridays. This makes sense — these are the two days where anyone with a desk job spends his or her afternoon thinking wistfully of fresh air and open roads. At Strava, we’re doing everything we can to make those daydreams and memories more vivid.
Strava has done a lot of work figuring out how athletes move; we have global heatmaps, challenge heatmaps, personal heatmaps and even Strava Metro. But what about where people go to stop? I looked at 4.3 million rides in the San Francisco Bay Area to answer this question and found the best places to meet friends, have coffee or just check out the view.
The results, browsable at labs.strava.com/top-stops, show some interesting patterns. For example, our cyclists tend to favor Peet’s over Starbucks in Los Altos, Cupertino and Orinda, and no one stops at Philz on the weekends. For meeting places, the most popular are the Golden Gate Bridge and near the Woodside Bakery. The data also made us ask questions, like who is the Tuesday/Thursday morning Headlands Raid ride waiting for every week on Hawk Hill?
One of the goals of this project was to automate everything to see if this could be done on a larger scale. So I simply ran the rides through the following steps:
- Find all the stops over 5 minutes (hopefully that filters out stop lights)
- Cluster the 2,771,301 stop locations. Instead of building or wiring in a hierarchical clustering algorithm, I aggregated them using the heatmap code which buckets points by pixel.
- Run the 4607 pixels with more than 50 stops through the Foursquare venues explore API. I did this several times to make sure all the categories were covered, like coffee shops, markets and parks.
- Map all the stops to their closest venue and find the 150 most popular.
- Remap the stops to these popular venues. This helped clean up some data issues with Foursquare but also might have hid some less popular venues.
- Finally I aggregated up all the numbers and created the UI.
I have to admit, I did do some manual work to clean up a few locations where there were two popular venues near each other. For example, I combined Alpine Lake into Alpine Dam and Mount Diablo State Park into Mount Diablo North Summit. But I did find these locations with a script
The resulting 150 locations seem to make sense to me. However, since the process is mostly automated, I would expect that the map is missing some popular spots that are not included in Foursquare’s dataset. If you find some, I’d like to know so that the process can be can improved. As for number 2, it’s the top of Old La Honda, where people stop to rest. If you really can’t believe it, you should, there’s even a cyclist hanging out in the Google Street View!
Starting Tuesday, July 29 at approximately 15:10 PDT, Strava experienced an outage that disrupted all uploads for almost 14 hours. For a period of about 35 minutes, all uploads to Strava failed before we determined the root cause and put the site into maintenance mode to determine the best course of action. We brought the site back into production at 16:10 PDT, but with all background job processing disabled. This allowed most interactive parts of the site to work. Uploads to Strava were captured, but were not processed until we were able to fix the underlying problem. The problem was fixed early Wednesday morning and all uploads were processed by 7:00 PDT.
When we process an upload, we identify and store multiple streams of data. Streams are time-indexed sequences that contain data for location, distance, speed, and so on – the core part of any activity recording. While the stream data itself is stored on Amazon’s simple storage service (S3), we do store some stream metadata – in particular, which streams are associated with a given activity – in a table in a database. Each row has an auto-incrementing identifier which was stored as a four byte signed integer. Although the range of signed integers goes from -2147483648 to 2147483647, only the positive portion of that range is available for auto-incrementing keys. At 15:10, the upper limit was hit and insertions into the table started failing.
We discussed a variety of approaches for tackling the problem, ranging from keeping the site down entirely to emergency code changes to messing around with database internals. In the end, we felt that allowing uploads but temporarily turning off upload processing was the approach that carried the least amount of risk while still allowing the site to remain generally operational as we fixed the problem. Around the same time that we brought Strava back online, we started running a database migration to change the type of the identifier from a signed integer to an unsigned integer. Unfortunately, we were not able to complete the migration in place so we brought up a fresh database slave, performed the migration there, and then proceeded to fail over to the new slave. The amount of data in the table was such that the migration ended up taking about 9 hours.
What We’re Doing
Normally we would have weeks of advance notice for this type of issue – we have in the past run many migrations to adjust data types or make other schema changes – but we did not have keyspace monitoring in place for this particular database. We have already conducted an audit of all primary keys and will be preemptively running schema changes to change all remaining signed integer primary keys to unsigned integers. Additionally, we have already put in place additional monitoring and will be reviewing keyspaces as part of our ongoing production processes.
We will also be reviewing our ability to monitor and effectively control failing uploads. At the current time, we only have an overly broad way of disabling uploads. We need to be able to instantaneously disable either uploads or upload processing and to provide the appropriate messaging in our applications.
We recognize that we did not communicate with you, our users, nearly as effectively as we could have. We know that transparency and prompt updates are important, and as such, we are starting to work on ways to improve our communications. This includes streamlining our internal communications and making it easy for notifications and updates to be posted to various locations including strava.com and our mobile applications, and our twitter account. We are also putting together a site that will include up-to-the-minute information on the availability of Strava.
We are athletes and users of Strava ourselves. It is an important part of our personal lives as well as our professional ones so any problem with Strava frustrates us as much as it does you. We are continually working to ensure the highest level of availability and reliability. However, there are times when things either fall through the cracks or impact us in ways that we had not anticipated. Each of these incidents gives us the chance to critically re-examine ourselves; to improve our infrastructure, practices, and processes; and to come back stronger than we were before.
In March, we released the Strava 4.0 Android and iPhone Apps, which featured a completely redesigned Activity Feed. In order to bring the new design from a concept to a functional product, both platform teams went through several iterations of implementation and performance tuning. This post highlights some of the techniques we used along the way, obstacles we faced and limitations we discovered.
One of the most striking differences in Strava 4.0 is the increase in maps used throughout the feed (prior to 4.0, we displayed thumbnail maps and only showed them in the “Me” Feed). While both iOS and Android platforms provide robust mapping APIs (complete with interactive view objects), we knew from prior experience that scrolling performance would be unacceptable if we simply added MKMapView (iOS) or MapView (Android) instances to Feed Entries. Those classes are designed to be high definition, interactive map views, rather than lightweight, static views that scroll nicely in a long list with many items. So, instead of using native maps we used map images from a remove provider and display them as bitmaps in the Feed Entries. The initial implementation was acceptable, but requesting larger map images in greater volume over a mobile network introduced a substantial amount of “placeholder” images while scrolling. In order to add interesting information to the Activity Feed Entries before their corresponding maps were loaded, we considered what additional data we could display without any additional loading. As it turns out, every Activity (Run, Ride, etc.) displayed in the Feed has a “summary polyline”, so as we prepare a Feed Entry for display we:
- Immediately render the Activity summary polyline on a background thread and display it over a simple background. The polyline view is generated using the Mercator Projection.
- Request the map image from a remote provider on a background thread and fade it in when it is ready.
With this implementation, a quick scroll through an Activity Feed will always show a summary polyline and give Activities “shape”, even if location images have not yet been retrieved.
iOS Performance Tuning
Refraining from prematurely optimizing, the iOS team first developed the core implementation of the new Feed before starting to look at what could be done to make it perform better. The new Feed cells are rather complex and so is the Feed.
However, one performance improvement that was agreed upon from the get-go with design was that the cells would not change height based on the text in them. While dynamically sizing cells based on the text content is not necessarily difficult to do on iOS, it can be rather costly in complicated cases. The main issue is that before the cells even get instantiated you have to compute how tall they will be, essentially performing the entire layout (and text rendering) twice. So, instead we have a few different types of cells which each have their own fixed size and the text resizes inside the boundaries of the cells with the gray header taking more or less space on top of the map.
With that in mind, we implemented our cells and then started analyzing the performance. On the most recent devices, everything was buttery smooth, but on older devices, not so much. Transparency is the main enemy of smooth scrolling in Table Views on iOS. The best tool to figure out transparency issues is in the Simulator’s Debug Menu:
With this option turned on the screenshot of the iPhone Feed above turns into this:
The red areas highlight which parts of the screen are being blended. The darker the red, the more blending. In order to minimize the blending we made sure that all the labels which could be opaque were opaque and had a proper background color. In addition to the background color, we decided, with the design team’s blessing, to take a more aggressive approach on older devices and slightly altered the appearance of the Feed, removing as much transparency as possible and removing all cross-fade animations which were happening when a map finished loading. And this is what the Feed looks like on iPhone 3Gs and iPhone 4:
As you can see on the right screenshot there is significantly less red, significantly less layer blending. These tweaks led to much improved scrolling performance on iPhone 3Gs and iPhone 4.
Android Performance Tuning
The Android implementation also began by first fleshing out the basic views for the various Feed Entry types with our existing set of utilities, deferring any performance optimizations until the functionality was complete. We then turned our focus to improving performance by evaluating our image loading library. With the new feed, a single Entry may have up to 3 associated images (profile photo, map background and Instagram thumbnail). In prior releases, the library we used for requesting and displaying remote images handled requests serially, but with the new Feed a noticeable backlog became evident when scrolling through multiple Feed Entries. To reduce this backlog, we had to find or build a new image loading library that would allow us to retrieve remote images in parallel, cancel pending load requests and manage image caching. After a bit of research and experimentation with several of the popular libraries available, we found Google’s Volley to be the fastest and most customizable tool for our needs. Requests are easy to create, cancel and most importantly, modify. We wrapped our Volley usage in a custom class that adds LRU memory caching, disk caching, custom callbacks, animations and device-specific parameter tuning. This allows us to limit the amount of resources used for loading images on older devices.
Once we had our new image loading library configured and working properly, we profiled the network traffic while scrolling through the feed and found that on high-resolution devices, the static map background images we requested (in png format) approached 300KB in size in urban areas. At that rate, scrolling through 10 new Feed Entries would consume 3MB of network data in map images alone. To minimize traffic, we changed the encoding quality of the images we requested to 70% jpg, which reduced the image footprint by roughly half with an imperceivable reduction in visible quality to the overall Feed Entry.
With things running relatively smoothly, we still noticed a seemingly random “stuttering” as we scrolled through the feed. A quick look at the GPU rendering profile (which can be enabled on an Android device under Settings->Developer Options->Profile GPU Rendering) confirmed that we were occasionally dropping frames as we scroll.
To diagnose this, we ran our app through the Android Monitor tools and found that Garbage Collection (GC) is the main offender. In order to scroll smoothly at 60 frames per second (fps), a single frame must take no more than 16 milliseconds (ms) to render. As seen by the log output, we were seeing GC occur frequently when scrolling the Feed and taking up to 115ms (that’s about 7 dropped frames).
In order to minimize our memory consumption, we made several adjustments:
Hunted down any old code that was excessive in its Object creation (which is good practice anyway)
Changed the Bitmap decoding configuration to use RGB_565, which consumes half the memory of the default ARGB_8888. Given our Feed images are not fullscreen and predominantly opaque, this change in decoding had a negligible impact on the user experience.
These changes helped, but after inspecting snapshots of heapspace allocation we discovered the main trigger for Garbage Collection in the new feed is the frequent allocation of memory for all the remote image Bitmaps. The greatest thing we can do to improve Feed scrolling performance is add a reusable Bitmap pool and decode new Bitmaps in the same memory space as stale ones. To date, we have not found a library that does this for us, so we plan on building this ourselves.
Overall, the new Feed gave us a great opportunity to focus on App performance and we continue to research and leverage new techniques with each release. Please leave any questions, feedback or recommendations in the comments section below.
Update (Nov. 3, 2014): The map now contains 160 million activities and 375,000,000,000 points.
Recently we released a global heatmap of 77,688,848 rides and 19,660,163 runs from the Strava dataset. This was more of an engineering challenge to create a visualization of that size than anything else. But still, the map has raised many questions about how and where people run and ride. Some of these can only be answered using the raw data, which is addressed by our Metro product.
The code to generate the map is the grandchild of a heatmap I built almost two years ago. Last year the code was cleaned up and became the Personal Heatmaps feature on Strava. This time it has been refactored to handle the large dataset by reading from mmapped files stored on disk.
To start out, the world is broken up into regions presented by zoom level 8 tiles. Each one of these regions has a file containing the sorted set of key/values where the key is the pixel zoom and quadkey and the value is the count of GPS points. The quadkeys make it so all the data for a tile is stored sequentially in the file. Pixels with no GPS points are excluded and only every 3rd zoom level is stored in the file. The values for missing zooms can be found by adding the 4 to 16 values from higher zoom levels. Skipping zoom levels saves a bit a disk space, but it also preloads into memory the region of the file needed for deeper zooms.
This results in about 9000 files (6300 for rides, 4700 for runs) that are all opened as memory mapped files when the server starts. When a request for a tile comes in, the server finds the corresponding file handle and does a binary search on the keys. Since the info for the tile is stored sequentially in this file, it can do a fast read and build a 2D array of the number of GPS points in each pixel of the tile. Now those values need to be normalized to a value between 0 and 1 and colorizer
The normalization is very local, taking into account the 8 neighboring tiles. For each tile the 95% percentile, non-zero, GPS count is computed. These values are averaged into 4 corner “heatmax” values for the current tile. The count for every pixel is divided by the bilinear interpolation of those values, capped at 1. This [0,1] value is used to color each pixel using a gradient function. You can see the local effect on roads that branch away from popular routes.
This is all done on the fly (minus memcache and a CDN) and takes about 200 milliseconds per tile. Why serve on the fly? Well, the ride map has 106,991,000 unique tiles, times three colors, plus the run and both versions and you’ve got a lot of S3 objects. This saves that step and lets me update parts of the map and tweak or add colors as needed.
Everything is hosted on a single c1.xlarge EC2 instance which maxes out at about 150 tiles a second. Because the files are memory mapped, the OS does all the caching. Still, the process is IO bound as users are accessing the map all over the world. Using an SSD backed instance solves the IO issues, but they’re way more expensive. Given that the CDN serves most of the tiles, I figured small slowdowns from IO bottlenecks are okay. This only happens when we have huge traffic, like when being featured on a Belgian national news site.
There have been a lot of suggestions on different types of maps to create, but I’m not really sure what’s next for this map, or the code. I think incorporating direction of travel could look really cool, but right now I’m more interested in using the map data. If you think about it, the heatmap is just a density distribution of GPS points. A “noisy” GPS stream could be corrected using these probabilities. The Slide Tool represents some of my initial thinking in this direction.
Over the last several months we’ve been busy developing an API that works for all parties involved: strong privacy for our users and consistent actions and endpoints for our developers.
First and foremost, we want all of our users to have access to their data. A few months ago we built the bulk activity export tool, but the API now provides more fine-grained and complete access. To share data with 3rd parties we now have OAuth which allows users to grant (and revoke) access at any time, without sharing passwords.
We also have new and complete documentation to help developers get started quickly. We’ve also worked to make the API consistent and predictable. Updated authentication allows for simple “Log in with Strava” or “Connect with Strava” actions.
This is just the beginning. Over time we’ll continue to add endpoints and build libraries to make the Strava API easy to use on all platforms. If you have any questions please post them to our developer forum or contact developers -at- strava.com
The main activity feed on the Strava dashboard does a great job of keeping athletes up to date on the riding and running activities of other athletes that they’re following, but we thought it might be interesting for athletes to be able to engage more with the social activities of others on Strava: what activities are earning a ton of kudos, or sparking conversation among others being followed.
Enter the Minifeed, an experimental feature we recently introduced. The Minifeed is a widget on the Strava dashboard that displays comments, kudos, and activities involving athletes that you follow, in real time. If you haven’t already given it a try, you can enable it on the X-Feature page.
This blog post will describe some of the technical challenges we faced, decisions we made, and technologies we used to develop this feature on our server infrastructure.
When relevant events (comments, kudos, activities) occur on Strava, we need a system to identify these events, process relevant information, identify interested parties, and then finally deliver data to clients. We’ll cover these steps in roughly the same order that the event data follows.
Kafka is a distributed, durable publish-subscribe messaging system designed for very high throughput. We use Kafka throughout our infrastructure for various event and message processing tasks, including here for publication of application activity. When an athlete gives kudos, writes a comment, or creates an activity, an application server generates a corresponding event and publishes it to the relevant Kafka ‘topic’ (kudos, comment, activity).
Storm is a distributed real time computation system that allows for flexible, scalable processing of streams of data. Storm will handle many cluster management tasks automatically, including restarting and relocating worker processes when failures occur. As a developer, you specify the types and quantities of workers you want, their behavior, and their relationship to each other; Storm handles most of the rest. This makes it very easy for us to develop and deploy systems to process events from Kafka. There are three main important concepts in Storm:
- Spouts are simply sources of data streams that emit tuples.
- Bolts accept tuples from streams as input, and emit other tuples as output.
- Topologies specify networks of spouts and bolts. A topology specifies the spouts in the network, and defines the connections between those spouts and the bolts, as well as between bolts. For each bolt, it indicates which other bolts and spouts it should receive input from, the parallelism of the bolt (how many running instances of the bolt to create) and how that input is distributed, or grouped, across instances of the bolt.
There are several different types of grouping, including:
- shuffle grouping: tuples are distributed to bolts randomly
- all grouping: tuples are sent to *all* bolts
- fields grouping: tuples are sent to individual bolts based on a subset of fields in the tuple such that equal values for that set of fields always go to the same bolt
We use a storm topology to process events: determining what athletes should see each event, populating them with more detail, and finally emitting them to be sent to clients.
First, we have three spouts for handling events from the Strava application servers: a kudos spout, comment spout, and activity spout, all of which are instances of a Kafka spout that emits tuples consumed from a designated Kafka topic.
All three of these spouts then feed into a parse bolt (selected at random with shuffle grouping), which then determines what athletes should see each event in their Minifeed. We determine this based on the follower lists of the athletes involved, excluding those who are blocked by either. Of course, at any given time not all of these followers will be active on the Strava website, so we don’t want to bother doing anything to show events to those who won’t see them anyway. To handle this, we maintain a set of currently active site users (described in more detail below), and determine the intersection of the two sets: all active site users who should be shown the event in question. The parse bolt emits a tuple for each of these users with the user and event information.
Tuples from the parse bolt are then sent via shuffle grouping to a random event maker bolt, which reads the limited information provided with the event (e.g. IDs for athletes involved, activities, comments), and performs database reads to fetch details that we would like to display (e.g. athlete names, comment contents, activity titles). The bolt then emits the annotated result.
Finally, a sink bolt receives these annotated tuples and performs writes to a Redis instance, which is in turn read from when sending updates to clients.
We host a WebSocket server using the Play Framework. Web clients create WebSocket connections to this server when a Strava dashboard page with the Minifeed is loaded, and keep the connection open, listening for events as long as the page stays open in the browser. The server, in turn, reads from Redis and writes events to the WebSocket as they are read.
Before going into more detail on the data maintained by Redis, it’s helpful to understand more detail about how we determine “active” users. As mentioned above, we don’t want to process events for all Strava athletes all the time; it’s only necessary to handle events for those athletes who have the Minifeed enabled, and are active on the website. However, given the live nature of the Minifeed, if we only processed events for those athletes with an active WebSocket connection at any given time, the Minifeed would be empty when it is initially loaded. Instead, we consider any athlete that has connected at any point in the last week to be active, and maintain a list of recent feed items for each of these athletes in addition to handling events in a live fashion as they occur.
There are three different types of data handled by Redis in this scheme:
- Lists of recent feed events, one for each athlete, keyed by athlete ID. When an athlete initially loads the dashboard, before sending any live events, the list of recent feed events is read from Redis and sent to the client. The aforementioned sink bolt writes new events to this list, and keeps it trimmed to the most recent 25 events.
- The set of active subscribers. We maintain this in a global sorted set of athlete IDs with a “last active” timestamp as the score. This “last active” timestamp is updated by the websocket server once an hour as long as the connection is active, and the Storm topology removes entries more than a week old.
- Live events pub/sub: After sending the recent feed events to the client, the WebSocket server subscribes to a channel keyed by athlete ID. The Storm sink bolt publishes events to this channel, and the WebSocket server forwards them to the client as they are received. When the client disconnects, the server unsubscribes.
This is a general end-to-end overview of how the Minifeed works behind the scenes, there are some additional optimizations and bits of polish not detailed here both planned and already completed. As it’s currently an opt-in experimental feature, load on the system is fairly light compared to the rest of Strava infrastructure. Over time, as the number of users with the Minifeed enabled increases, it may be necessary to add additional caching layers, reconfigure the Storm topology to maximize cache hit rates, or reconsider how active subscribers are determined and handled. That said, since launching, it has been trouble-free, in no small part thanks to the convenience of our existing Kafka-based event logging system and the Storm and Play frameworks.