Mysql – Best practice design when expecting large data in a table

database-designMySQL

I have come across a way that kind of works for me, and was just wondering if it would have any side effects, or, if I'm just thinking about this the wrong way.

I first had this idea when I was building a relatively simple location database. It would hold a table of 249 countries, and a table of 143000 cities with their longitude and latitude.

--------------------------
-         Country        -
--------------------------
-   id  -    name        -
--------------------------
-    1  -   England      -
-    2  -    Wales       -
--------------------------

-----------------------------------------------------------
-                City                                     -
-----------------------------------------------------------
- id -     name   -      lng    -    lat     - COUNTRY_id -
-----------------------------------------------------------
-  1 -  London    -  -0.127758  - 51.507351  -    1       -
-  2 - Canterbury -   1.078909  - 51.280233  -    1       -
-----------------------------------------------------------

Now, when I was using the Haversine equation to find the closest city given a longitude and latitude, it would only do about 10 results per second. My thought behind this was because it had to do this calculation on 143000 cities …

So, instead of go out and buy a supercomputer to do these calculations, I thought I could narrow down which cities it had to do the calculations on.

I done this by basically dividing the world into 2448 grid squares, and putting those cities in a table of their own, effectively now having 2448 tables. I then use PHP to find which grid square the given longitude and latitude resides in, and then query that table, and it's surrounding 'grid squares', or tables.

This resulted in a 10 fold speed increase, returning over 100 results per second.

I was wondering if the same concept could be used in say, a user database, where the tables may be split depending on the first 2 characters of the persons username. So, if you had 1,000,000 users, (And they were only allowed a-Z for their usernames), you could effectively have these spanned over 676 tables, averging about 1500 users per table, and then increasing the speed at which a user could log on?

Ha … Notice the question mark at the end …

So, I'm expecting a lot of 'Nope … Thats just wrong' … But I kinda want to know if my brains just having a stupid week, or if someone has seen something along these lines.

Best Answer

Yes, the same concept could be used. What you have done is re-implement table partition, but in user space. Most industrial-strength RDBMSs will have this built in. The provided functionality often includes additional abilities, such as efficiently adding and removing partitions at run time without applicaiton changes. By choosing to roll your own you miss out on these additional features. Additionally you complicate some things, such as surrogate ID uniqueness checking, aggregate queries across your whole user community and DRI referencing the "user" table.

Be aware that your sub-tables are very unlikely to be well balanced. There aren't many Mr. Aardvark or Ms. Zymology in the world but a lot of Smiths and Jones.

The reason key lookups are faster on smaller tables is because the indexes have fewer levels, assuming you have B-Tree indexes. Therefore the DBMS has to read fewer pages to get from the index's root node to its leaf & data pages. The index on your 676 sub-tables is likely to be only one or two levels deep and so incur only one or two page reads to read a key's row. In contrast a full B-Tree built on 1M rows may be three or four levels deep and require that many page reads per lookup. Built-in partitioning can give you similar benefits if you define your index as partitioned, too.

This is a good reason to keep your index keys compact, if you have a choice. For example, if your user name is 30 characters long this will take 30 bytes to store and you will get a certain number on a page. If instead you calculate an integer hash (4 bytes) of the user name and index that, there will be 30/4 = 7-ish times more rows per index page and the index will likely have fewer levels. (Of course you will have to account for potential hash collisions.) Similarly limiting the amount of free space you have in the index to allow for inserts will help increase density.