Mysql – Precautions when designing database schema if tables are going to have billons of rows

database-designdatabase-recommendationMySQLperformancequery-performance

I am about to start work on telephone dictionary kind of project. Its confirm that there are going to be billions of records in dictionary table, and each entry from this dictionary table may have further inventory table with reference to dictionary table. I have not worked earlier with such huge database.

InnoDB is good for maintaining relational database. There are category and subcategory references, so i am going to use InnoDB. There will come a situation where I need to show total count based on category or subcategory, or even based on state and city. So on… it may any combination.

I am familiar with creating indexes on most searched columns. I have heard about table partitioning also helps to speed up queries.

My Questions is what points should I consider at this early stage when creating such database tables which are going to have billions of rows, so that later on when tables grow larger, I can keep table performance at high level with select queries and DML quires (insert, update).

Guidance will help me a lot.

Best Answer

Start understanding that the problems you face have nothing to do with the database technology you use. They are caused by physics, and physics does not care about oracle, microsoft or open source. It is the same.

Generally - depending on the queries you may want a HUGH server (albeit "billions of rows" sounds too much for a phone dictionary). Or cluster of servers. Not "I got 8gb memory". I have a sql server here with 48gb memory and guess what - that is small, not to say tiny, when talking big dada.

Also, ad hoc MAY not work. Oracle Exadata is great hardware to make ad hoc work by pure power - but even there are limits.

There will come a situation where I need to show total count based on category or subcategory, or even based on state and city.

There is this concept called a data warehouse and it has a totally different schema than an oltp (transactional) database. I suggest you read up on database concepts and the difference between OLAP and OLTP, also about normalization and - the other side - star schemata (used for reporting.

Generally you may run into a real time problem here, so reglarly updated precalculated tables may be in order for this. The numbers may be off, but who cares if the number by city is 5 minutes late (or whatever update interval you use).

A lot depends on exact business requirements. On what is acceptable, how often certain queries happen, whether they can be offloaded. A data mart / warehouse may be a good idea to isolate reporting / aggregating queries from OLTP queries to lessen burden on servers. Sometimes you can not. But not everything can / should be done ad hoc, especially when mutiple users hit the same (heavy) queries.

Generally - guidance is "read some books or hire someone who knows how to work with databases". Point being - you dont even know enough to ask the right questions. And you focus on Innodb for the totally wrong reason ("good for maintaining relational database" - that is like saying "Big mac makes good food because their hamburgers have meat" - no real sense in the statement at all.