How not to blow up the production database

By Ronald Chen

Ronald Chen

When running a production database, one needs to carefully consider all the data access patterns. The usual trap people fall into is optimizing for one data access pattern at the cost of all others. This causes production systems to fail spectacularly or lose millions of dollar as the system becomes slower over time.

In this blogpost I will be describing the problems and solutions we’ve come up with to solve some of our data access patterns at Battlefy.

Battlefy data access patterns

Our two main data access patterns are

  1. Critical sparse random writes for registration, check-in and score reporting

The common feature that spans both these access patterns is after a score is reported for a single game, the aggregate stats are updated. For example, in Apex Legends the number of kills for a game is recorded for a particular team. The tournament stats page could show several different aggregations by team.

  • Total number of kills

Imagine the aggregate stats for all teams are shown in a table and offered the ability to sort the table by total or average number of kills. The total number of teams is long and we would need to implement pagination.

For our example, here are the domain relationships. The arrow with labels are the foreign keys.

How would you update the aggregate stats after a score report? Let’s walk through a few attempts.

Attempt #1: Buy it, use it, cache it

At Battlefy our primary database is MongoDB. The naïve approach is to optimize the collections for write, calculate the stats on read and put a cache in-front of it.

The following is condensed code for illustrative purposes only. We do not write production code like this at Battlefy.

Load analysis

Given this approach we can formulate the total database load for both POST /apex-legends/:gameID/report-score and GET /tournament/:tournamentID/team-stats.

Let’s say for the average Apex Legends tournament it has 640 teams in a single elimination free for all, which means a total of 32 games and we set the cache timeout to be 5 seconds. There are far more viewers than score reports and let’s use a conservative 1000:1 read to write factor. Then the total load would be 32 + 1,000 * (32 + 640) / 5 = 134,432.

This assumes the collections have proper indices in place or else the load would be much higher.

Notice how in this analysis it does not take the page into account? This means we are loading all the games and teams in order to show a single page for the team stats table. The real cost of the load is hidden by the cache. In fact the cache gets in the way of providing the best user experience. Ideally, the aggregate stats would update sooner than 5 seconds upon score report.

Can we do better?

Attempt #2: Aggregate as team stats

Why not flip the problem upside down? Instead of calculating the aggregate stats every single time, let’s calculate it once upon score report and store it in the teams collection, then we can form a MongoDB query that implements pagination properly.

Load analysis

This approach makes the score report much heavier, but the pagination is far cheaper. The formula for this approach is

Using the same numbers as before in the first approach and a teams per page of 10, we have a load of 32 * (32 + 640 + 640) + 1,000 * 10 = 51,984, or in other words 0.38x of the first approach! Performance was improved by shifting the work towards the infrequent operation (score report) to make the frequent operation (load page) cheaper.

Please do keep in mind these numbers are all made up to make my argument, but this shows how one can compare different approaches. In your own situation, you will be able to plug in real numbers.

While we have improved the database load, we’re starting to paint ourselves into a corner. Each Apex Legends team is composed of 3 players. What if we wanted to include the players stats, ladder rank, Twitch URL in the stats table as additional columns? Do we keep adding more “stats” fields to the teams collection in order to maintain the paginationability?

Also with this pattern POST /apex-legends/:gameID/report-score just keeps getting longer as we calculate more aggregations. It would already double in size if we were to implement player stats table.

We’re going to need a better solution.

Solution: You get a view, and you a get view; EVERYBODY GETS A VIEW!

Instead of having the teams collection do double duty to maintain both the ease of update and paginationability, we simply move the stats into its own collection teamsStats. This new collection is a Materialized view.

Holup, but don’t materialized views copy information? Doesn’t this mean we would have team name in multiple places? Yep, and that is OK. Having the team name in multiple places is only a problem if you don’t know which one to update when the team name changes.

The key idea here is to have a single source of truth, but allow that source to be freely copied into materialized views. The code to update the team name should only every worry about the single source of truth. We need something else to maintain the materialized views.

Let me show you the code and this will all make more sense.

Core algorithm

This Views concept allows us to decouple teams from teamsStats. We can now add as many views as we want and perfectly paginate them all. We never have to worry about messing up source of truth with these extra views.

Views also introduces a useful seam and which allows us to add many enhancements:

  • add an operation hook to regenerate views on-demand

These enhancements were not possible with attempt #2, which welded everything together too closely.

Note the view service implementation is just the shortest code to show how one could work. Production would require more code in order to handle errors and admin entry points for operations. There are many different ways to implement a view service.

Load analysis

The load is identical to attempt #2 but we have managed to improve the extensibility of the system.

Wait, its all design? Always has been.

I’ve been a little cheeky here as what you have just read is what we internally call a “technical design document”. For every non-trivial feature, we document the problem, constraints and a few possible solutions. We debate the tradeoffs of the various solutions and pick the best one.

The technical design document is invaluable as it allows the team to review your solution before actually implementing it. Numerous problems are caught at this stage and the best solution can only come about with enough eyeballs on it.

We also use technical design documents for training, mentorship and task delegation. We define what a feature is suppose to do separately from how we are going to implement it. The technical design document describes the latter. Given the scope, business rules and a technical design document, we expect any Junior Developer to be able to implement a feature.

Do you want to learn how to write the best technical design documents? You’re in luck, Battlefy is hiring.