RouteMaster is the engine that powers point to point path finding for Strava routes in our newly released Route Builder. The killer feature of RouteMaster is data drawn from real Strava athletes. Instead of merely knowing which paths are designated for cycling/running, we analyze tens of billions of GPS points from millions of Strava activities. This data gives aggregate statistics about how athletes use every road and path section in the world.
This post is an in depth engineering overview of routing and the design behind RouteMaster. Building a routing engine from the ground up was a fantastic experience, and I wanted to share some of the more interesting components and problems that I solved along the way. A word of warning: this post gets very technical and assumes a good deal of background with data-structures and algorithms.
A simple approach to road routing is to minimize total distance along the path between the origin and destination. This can be formulated as the single pair shortest path problem over a graph with intersections as nodes, road sections as edges, and edge weights or costs as geographic distance. More sophisticated routing engines for driving often use a more sophisticated notion of cost. The goal of a driver is almost always to minimize travel time. Knowledge about speed limits, current traffic, and even turning costs can be used to construct a cost function that is an estimate of the expected time of the journey.
RouteMaster’s dynamic cost function takes into account factors that aim to optimize routes according to the preferences of athletes. For example, when the route popularity toggle is turned on, edge popularity attributes are considered in the cost function (as well as length). RouteMaster increases the cost of roads that are seldom traveled. This leads to routes that are more likely to use popular cycling routes, and that might be longer (but hopefully more enjoyable!) than the shortest route. The elevation toggle penalizes vertical elevation gain and allows routes with minimal climbing. We hope to add more routing controls in the future.
A graph store is required to encode the topological structure of road networks. Graph edges correspond to road sections, and graph nodes correspond to road intersections. For example, a four way intersection would be represented as a node with four connecting edges. The graph store supports a query that returns the edges adjacent to any node.
We experimented with existing solutions such as neo4j and Graphhopper before deciding to use a custom graph store implementation. Graphhopper is a great project, but the codebase was at the time too specialized to automobile routing. Neo4j is ultimately designed to support mutable graphs, and thus can’t hope to do things as efficiently as a custom immutable implementation. We don’t need to support live editing of route data– updates are batched and the database is regenerated every few weeks. Because our graph is immutable, implementing a custom data store wasn’t too difficult. We use a adjacency list representation of the graph modified so that everything fits in a single array indexed by edge_id. Graphhopper gives a good overview of a similar data structure. The graph store only uses 16 bytes per edge, and 4 bytes per node.
Both edges and nodes also have attributes. Node attributes include lat/lon coordinates and elevation. Edge attributes include information about elevation gain, gradient, length, popularity, and type of road. Constant sized node and edge attributes are stored in a binary array flatfile. We also use a blob storage data-structure that keeps variably sized edge attributes. These attributes include road names and compressed elevation and path geometry. These variable sized attributes are not accessed in the graph traversal stage– they are only used to annotate the final route with metadata.
A geospatial index determines the geographically closest nodes to the user’s origin and destination points. The index supports a nearest neighbor query. Tree based structures (kd-tree, r-tree) are most commonly used to solve this problem. However, tree based solutions typically have large memory requirements, especially if the tree is built using 64 bit pointers. Our index solution uses the clever geohash algorithm and stresses storage efficiency over high performance.
Geohash is a hashing algorithm for lat/lng coordinates. Geohashes can also be represented as an N bit integer. An N bit geohash represents a rectangle in lat/lng coordinate space. A subsequent bit of precision divides the rectangle in two. For example, all points in the western hemisphere have ‘0’ as the first bit, and points in the eastern hemisphere begin with ‘1’. A prefix of a geohash (fewer bits of precision) represents a larger box that contains the smaller box corresponding to the longer geohash. If you want to learn more about geohash, you can read a longer introduction or play with geohashing.
Our index consists of a sorted array of tuples containing (geohash(node.latlng), node.id) where the geohashes are stored to a full 64 bits of precision. We can use this array to efficiently lookup all records of nodes that fall into the rectangle for a geohash of any precision. Since the array is sorted by geohash, we just need to find the indices of the first and last records that begin with the geohash prefix of interest. These indices can be efficiently determined using a binary search like procedure. The subarray between these two indices contains all of the records of interest.
Nearest neighbor queries can be executed as follows: A procedure converts the input coordinate into a set of geohash prefixes that cover a circle of some small radius (40 meters) about the query point. (There is a lot of complexity in this procedure, checkout this presentation for a good overview of a similar implementation). The index records for each prefix are fetched. Geohash also happens to be reversible– you can reconstruct the original lat/lng point to some accuracy. We can thus order the set of returned nodes based on exact distance to our query point. If no points are returned on the initial query, the search radius is exponentially increased up to a limit of 5 km until a point is found. The radius limit is to prevent too many points being found at once.
Since the points inserted into the index are from real road geometry, there is a nice limit on the density of points– this means that there isn’t a worst case where millions of points need to be read from the index when doing a circle query. In practice, query times are typically well under 1 millisecond. This isn’t fast compared to a kd-tree, but is small compared to the time spent in graph traversal and route construction. Each index record is only 8 bytes for the geohash and 4 bytes for the node reference.
It is convenient to think of a route as a list of edges that yield a path from one node to another node. However this simple model is not always the case. Some of the longer edges are over 100 miles. With such long edges, it isn’t ideal to force the user to start and end routes only at nodes. Routes might even begin and end on the same edge. The geospatial index for RouteMaster thus contains additional points corresponding to points spaced out along long edges. These index entries refer to an edge_id rather than a node_id. They are added at points spaced approximately every 50 meters for all edges longer than 50 meters.
However, it isn’t enough just to know that the user wants to start their route in the middle of an edge. The ideal route might proceed in either direction along the starting edge. Since an edge is not a node, the search algorithm can’t handle this case.
To solve this problem, we add a ‘virtual’ node to the graph. The virtual node has two virtual edges that connect to the corresponding real node of the underlying edge where the route begins. These nodes and edges are virtual in the sense that they are only part of the graph within the scope of a single routing request, and thus don’t require any persistent memory. The search algorithm is free to consider routes that proceed in either direction by routing over the virtual edges. In the tricky corner case where the route begins and ends on the same edge, the routing problem is normally trivial and can be handled separately.
It isn’t always possible to get from point A to point B. If no path exists, a naive search algorithm must traverse every reachable edge before giving up. RouteMaster is aware of connected components of the graph and can efficiently reason about connectedness. All routable edges are inserted into a disjoint-set data structure. This data structure generates a labeling of nodes according to which partition they belong. A path exists between two nodes if and only if they are in the same partition. Using this property and the partition labeling, RouteMaster can decide in constant time whether a route exists.
Furthermore, RouteMaster attempts to pick roads that are connected to each other. If the nearest nodes to the route start/end points belong to different partitions, more neighbor nodes can be examined until a pair of nodes with matching partitions are found. There are hundreds of thousands of disconnected single edges which are avoided by this method. Route waypoints will often magically snap to nodes which are connected, avoiding isolated networks.
A* Search and Bounded Relaxation
RouteMaster uses the A* algorithm to find routes. To greatly summarize, A* is a heuristic based algorithm that efficiently finds the shortest path between a single pair of nodes. Nodes are explored according to a known current cost + heuristic estimate of remaining that ranks ‘open’ nodes on the frontier of the search. The known cost is the (optimal) path cost from the origin to the current node, and the heuristic is a function that estimates the remaining cost from the current node to the goal node. It can be shown that A* is optimal as long as the heuristic function never overestimates the remaining cost. For road routing, the standard heuristic function is typically distance as the crow flies between the current node and the goal node. This heuristic is optimal because no route between two points is shorter than a straight line. In most cases the heuristic will lead to fewer explored nodes (compared to Djikstra’s algorithm).
A useful speed/correctness tradeoff technique (known as Bounded Relaxation) is to multiply the the heuristic function by a relaxation factor greater than one. This makes the search more greedy– as the heuristic is more important. The algorithm becomes biased towards exploring nodes closer to the straight line between the origin and goal. Correctness of the algorithm is sacrificed– the first path found might not be the shortest. However, it can be shown that the path found is never worse than the relaxation factor times the cost of the correct path.
Through experimentation I found extremely promising results from this trade off. In some ideal cases, the number of nodes explored during a search was reduced by a factor of 20 with a negligible increase in path length (using ~1.2 times the correct heuristic). This trade off doesn’t work well when the shortest path is much longer than the distance as the crow flies. A common example of this is routing around a convex body of water.
RouteMaster is an efficient solution to a specialized problem. All data structures including geospatial index, graph representation, and node and edge properties use up less than 16 gigabytes for the entire world (140 million edges). Searches can traverse around 1 million edges per second. Routes under 50 miles can be typically found in a few hundred milliseconds. All data structures are immutable and are stored off the JVM heap in memory mapped files. Server startup time is nearly instant, and we don’t experience the ordeals of being bound by garbage collection. The server is implemented in scala using a finagle/thrift interface. Zookeeper is used for service discovery and load balancing.
I hope you enjoy building routes and exploring the world with RouteMaster! Please leave any questions or feedback in the comments below.