Is it a good approach to design tables according to the queries I’ll perform on it

database-designperformance

Watching this video, and pretty new to dbms.
The speaker explains that in a row-oriented DB, rows are read in blocks.
So, my understanding is that if I have rows with fewer fields, more rows can fit into a single block and when I query the table it should take less IO operation, resulting in better performances.. Am I right?

Can I extract the rule that I shouldn't design tables according to the entity they represent but, instead, to the frequency I'll read or update that fields?

For example:
table employers:

ID
Name (frequently used)
Badge number (frequently used)
Birth date (rarely used)
Birthplace (rarely used)
Should I split the table into 2?
tbl1: ID | Name | Badge number
tbl2: ID | Birth date | Birthplace

Best Answer

In most database management systems, data is is stored as pages, not blocks. Pages are normally 4 or 8 KB, depending on the database and how its been configured.

All else being equal, smaller row size will equate to better reuse of cached pages and less page reads on queries that require a large number of rows - so less I/O and faster read performance.

However

If you vertically partition the table (as you have in your example), there will be a slight increase to overall storage (equal to the primary key length and number of rows, plus the b-tree) and insert performance will be slightly slower as you'll need to maintain a PK-FK relationship between the two tables.

Furthermore, if most of your queries are for single-record lookups, you're still going to be reading a single page. There's a greater chance that page will be cached, but reading 4 or 8 KB off a modern disk is really not an expensive operation.

Splitting the table would require 2 page reads (and navigating two B-trees) when you require BirthDate/BirthPlace. Again, not really a big deal on modern hardware.

The only time I would vertically partition a table would be in certain data warehouse situations, or if BirthDate/BirthPlace were nullable and infrequently populated.

Other Considerations

If badge number is relatively small in size (say, under 20-30 bytes), the best thing you can do to increase performance would be to drop the unneeded ID column and make your primary key BadgeNumber since:

You shouldn't have duplicates in that column
Most likely you will primarily lookup on that column, so using BadgeNumber:

Saves you a column, making your table more compact
Removes the need for an index (and associated overhead) on BadgeNumber
Eliminates the need to join to your table to get the BadgeNumber when the table has a PK-FK relationship to another table.

There are other ways to reduce I/O and improve read performance. Most commercial DBMSs will support some form of data compression. This can fit more rows on a single page without any changes to the structure of the table, at the expense of some CPU overhead to compress/decompress the data as it is written/read. CPU is usually a cheaper operation than disk, so compression is usually a net benefit.

Related Solutions

NoSQL approach: design table “scheme”

w/o any secondary index I can think of PIVOT. (Data in Key-Value pair way)

enter image description here

Depending on the usage and query you are expecting you need to decided if key "LogText" needs to go in to another table or ok to put in Table-2.

You can have key on both columns and that will give you option of defining local secondary index to improve performance for complex queries like

ALL DATA where (TimeStamps is between x and y) and User = "ABC"

this can be done in two steps,

select ALL data where KeyIDs are IN 
( get unique KeyIDs for that satisfy this WHERE conditions.)

SQL Server – How to Improve Query Performance

I would suggest to try an index like this:

CREATE NONCLUSTERED INDEX NCI_ProcessState_ProcessStateDateLastModified_SleepUntil 
ON [TABLE](ProcessState,ProcessStateDateLastModified,SleepUntil)

This way you can filter sharp on the state as all others will result in bigger result sets due to the <= filter.

You can check the index usage easily with this query:

SELECT obj.*,usage.*
FROM sys.dm_db_index_usage_stats AS usage
INNER JOIN sys.indexes as obj
        ON usage.object_id = obj.object_id
WHERE database_id = db_id()

Just filter it on your index and take a look a the search and update columns.