Mysql – Database choice for efficient read-heavy application and millions to billions of datapoints

database-designmongodbMySQL

I am asking for suggestions what DB systems would best fit my use case.
I am developing an analytical tool that queries millions of documents. Documents are loosely related, they can be represented well in a SQL database. They can also be represented in NoSQL using embedded documents. It is important queries are processed very efficiently, within couple seconds if possible.

The total size of the database is expected to be 500GB+, most likely between 1-2TB.
There could be millions to billions of rows depending on the structure of the database (amount of columns in SQL or how embedded the documents would be in Mongo).
Databases I've looked at/modeled were MongoDB and MySQL or MariaDB. MongoDB would include lots of duplicated information for each player document (Game characteristics). Whereas MySQL or MariaDB could be used as SQL with game table and player_game table describing each event.

The data represents player actions across different stages in a single game. There are up to 4 different stages. There are multiple actions and action opportunities for those actions which are necessary to be stored. On top of that, each action can have a single-word description.

Example Query #1:
"how often does a player take action A given the he had the opportunity to take action A in a game with characteristics X, Y and Z".

Query could look like:

SELECT SUM(action_1)/SUM(action_1_opp) FROM player_game 
WHERE action_1_opp = 1 

with characteristics

SELECT SUM(action_1)/SUM(action_1_opp) FROM player_game p
INNER JOIN game g on p.game_id = g.game_id
WHERE p.action_1_opp = 1 AND g.player_count = 5

Example Query #2:
"how often does a player take action B given that he also took action A with description some_description"
There are millions of those game documents/rows, and multiple players for each game, therefore player_game table could grow to billion of rows.

query
If wide columns are not used, this will probably require a self join because we need to find multiple actions from the same game and make sure action A happened. If we need to filter for multiple actions that happened in the same game (likely query), it might require multiple self joins.

If wide columns are used, each action has its own column so we only check whether it's set to 1.

Example Query #3:
"how often does a player that has played the game for x amount of time took action A?"

SELECT SUM(A)/SUM(A_opp) FROM player_game 
INNER JOIN player on player_game.player_id=player.player_id 
WHERE player.time_played > 10000 

Example Query #4:
"how did a player P1 respond to action X played by player P2 with statistics XYZ?

It would require either storing who took each action (adding to the amount of columns) or a separate table tracking who took each action. Then we will require one join to pull player statistics and another to see who took each action (if separate table is used to store that).

Question 1:
Can MySQL or MariaDB handle that type of load, given that a JOIN would be sometimes necessary from game table or player table (those table would be much smaller than the game_player collection).

Question 2:
If SQL is the answer, is it better to have 100 columns for each player_game row with stage_1_action_1, such as:

game_id | player | stage1_action_1 | stage_1_action_2

Stages/actions would have more descriptive names, these are just examples to give you idea of the schema.

Or is it better to have a separate column for stage and cut the column by a factor of 3 but double to triple the amount of rows? Such as:

game_id | player | stage | action_1

Question 3:
The architecture looks very similar to a data warehouse star schema but since it's likely to be only up to 1-2TB in size, is that something that columnar storage would be worth trying out? MariaDB comes with columnar storage engine but I am unsure if it is open-source (much more preferable).

Thank you so much for taking your time to read through the post.

If there are other system than Mongo, Maria, I'd appreciate the input. Cassandra is another column store that could be worth considering.

Best Answer

For question 1 MariaDB or MySQL should be able to handle what you describe with no problem. In part the speed of your queries will be based on indexing your tables properly.

For question 2 without a doubt the 2nd choice is the better one. One row for every player, stage and action. Again, proper indexing is going to be key to reasonable query times.