When running a production database, one needs to carefully consider all the data access patterns. The usual trap people fall into is optimizing for one data access pattern at the cost of all others. This causes production systems to fail spectacularly or lose millions of dollar as the system becomes slower over time.
In this blogpost I will be describing the problems and solutions we’ve come up with to solve some of our data access patterns at Battlefy.
Battlefy data access patterns
Our two main data access patterns are
- Critical sparse random writes for registration, check-in and score reporting
- Coordinated stampede of public reads for tournament information and stats
The common feature that spans both these access patterns is after a score is reported for a single game, the aggregate stats are updated. For example, in Apex Legends the number of kills for a game is recorded for a particular team. The tournament stats page could show several different aggregations by team.
- Total number of kills
- Average number of kills
Imagine the aggregate stats for all teams are shown in a table and offered the ability to sort the table by total or average number of kills. The total number of teams is long and we would need to implement pagination.
For our example, here are the domain relationships. The arrow with labels are the foreign keys.
How would you update the aggregate stats after a score report? Let’s walk through a few attempts.
Attempt #1: Buy it, use it, cache it
At Battlefy our primary database is MongoDB. The naïve approach is to optimize the collections for write, calculate the stats on read and put a cache in-front of it.
The following is condensed code for illustrative purposes only. We do not write production code like this at Battlefy.
Given this approach we can formulate the total database load for both
POST /apex-legends/:gameID/report-score and
Let’s say for the average Apex Legends tournament it has 640 teams in a single elimination free for all, which means a total of 32 games and we set the cache timeout to be 5 seconds. There are far more viewers than score reports and let’s use a conservative
1000:1 read to write factor. Then the total load would be
32 + 1,000 * (32 + 640) / 5 = 134,432.
This assumes the collections have proper indices in place or else the load would be much higher.
Notice how in this analysis it does not take the
page into account? This means we are loading all the
teams in order to show a single page for the team stats table. The real cost of the load is hidden by the cache. In fact the cache gets in the way of providing the best user experience. Ideally, the aggregate stats would update sooner than 5 seconds upon score report.
Can we do better?
Attempt #2: Aggregate as team stats
Why not flip the problem upside down? Instead of calculating the aggregate stats every single time, let’s calculate it once upon score report and store it in the teams collection, then we can form a MongoDB query that implements pagination properly.
This approach makes the score report much heavier, but the pagination is far cheaper. The formula for this approach is
Using the same numbers as before in the first approach and a teams per page of 10, we have a load of
32 * (32 + 640 + 640) + 1,000 * 10 = 51,984, or in other words
0.38x of the first approach! Performance was improved by shifting the work towards the infrequent operation (score report) to make the frequent operation (load page) cheaper.
Please do keep in mind these numbers are all made up to make my argument, but this shows how one can compare different approaches. In your own situation, you will be able to plug in real numbers.
While we have improved the database load, we’re starting to paint ourselves into a corner. Each Apex Legends team is composed of 3 players. What if we wanted to include the players stats, ladder rank, Twitch URL in the stats table as additional columns? Do we keep adding more “stats” fields to the
teams collection in order to maintain the paginationability?
Also with this pattern
POST /apex-legends/:gameID/report-score just keeps getting longer as we calculate more aggregations. It would already double in size if we were to implement player stats table.
We’re going to need a better solution.
Solution: You get a view, and you a get view; EVERYBODY GETS A VIEW!
Instead of having the
teams collection do double duty to maintain both the ease of update and paginationability, we simply move the stats into its own collection
teamsStats. This new collection is a Materialized view.
Holup, but don’t materialized views copy information? Doesn’t this mean we would have team name in multiple places? Yep, and that is OK. Having the team name in multiple places is only a problem if you don’t know which one to update when the team name changes.
The key idea here is to have a single source of truth, but allow that source to be freely copied into materialized views. The code to update the team name should only every worry about the single source of truth. We need something else to maintain the materialized views.
Let me show you the code and this will all make more sense.
Views concept allows us to decouple
teamsStats. We can now add as many views as we want and perfectly paginate them all. We never have to worry about messing up source of truth with these extra views.
Views also introduces a useful seam and which allows us to add many enhancements:
- add an operation hook to regenerate views on-demand
- debounce view generation requests
- add view analytics, which could be used to see how certain features are performing
Load pagedatabase query to use read preference secondary to reduce demand on primary (as opposed to adding caching
Load pageas that would cause rows to be confusingly duplicated across pages if they happen to shift to the next page between cache invalidations)
- off load view generation into its own microservice that can scale independently
- use change streams to not require any code in
Score reportto signal document updates to
- put a view generation request queue in-front of view generators to allow for view generators to crash for free retries and increase operation visibility by being able to inspect the queue stats
- pre-deploy view generators to enable inspection of views with real production data before it is actually used by the frontend
- save views to a real-time database to have views update in the frontend without user interaction (however, a frontend with smart caching with
setIntervalwill go a long way before this is ever needed)
These enhancements were not possible with attempt #2, which welded everything together too closely.
Note the view service implementation is just the shortest code to show how one could work. Production would require more code in order to handle errors and admin entry points for operations. There are many different ways to implement a view service.
The load is identical to attempt #2 but we have managed to improve the extensibility of the system.
Wait, its all design? Always has been.
I’ve been a little cheeky here as what you have just read is what we internally call a “technical design document”. For every non-trivial feature, we document the problem, constraints and a few possible solutions. We debate the tradeoffs of the various solutions and pick the best one.
The technical design document is invaluable as it allows the team to review your solution before actually implementing it. Numerous problems are caught at this stage and the best solution can only come about with enough eyeballs on it.
We also use technical design documents for training, mentorship and task delegation. We define what a feature is suppose to do separately from how we are going to implement it. The technical design document describes the latter. Given the scope, business rules and a technical design document, we expect any Junior Developer to be able to implement a feature.
Do you want to learn how to write the best technical design documents? You’re in luck, Battlefy is hiring.