Sql-server – Structure the database to make search queries fast

database-designsql server

  • My computer has 64 cores
  • Microsoft SQL Server Data Tools 16.0.62007.23150 installed
  • I do have 500 Mb/s SSD for the moment

One initial question: Which SQL version would be best for 64 cores?

I am new to SQL databases and have understood that it is important how you structure the database so it will go faster later to search and extract the information needed (queries).

All data will be used only for CPU calculations and will not be displayed visually in any dataGridView, report etc. The data will be used for artificial intelligence/random forest.

I believe I have also understood that using data types that take up less memory is good for speed later on also, like using a smallint instead of an int if it will work with a smallint etc.

I like to ask if my structure that I am thinking of is well designed in order to extract information later or if I should do this a bit differently. The database will add stock symbol data and as I notice this database will be extremely big which is the purpose of this question.

This is the whole structure that I have in mind (Example comes after explanation):

  1. I will use 4 columns. (DateTime|Symbol|FeatureNr|Value)
  2. DateTime has format down to the minute: 201012051545
  3. Symbol and FeatureNr has smallint. For example: MSFT = 1, IBM = 2, AAPL = 3. So as you see. Instead of using strings in the columns, I have put smallint that represent those symbols/featureNr. This so search Queries goes faster later.
  4. The database will for example have 50 symbols where each symbol has 5000 features.
  5. The database will have 15 years of data.

Now I have a few big questions:

If we just filling this database with data for 1 symbol. It will be this many rows in the database:

1440 minutes(1 day) * 365 days * 15 years * 5000 features = 39,420,000,000

Question 1:

39,420,000,000 rows in a database seems like a lot or is this no problem?

Question 2:

The above was just for 1 symbol. Now I had 50 symbols which would mean:

39,420,000,000 * 50 = 1,971,000,000,000 rows.

I don't know what to say about this. Is this to many rows or is it okay? Should I have 1 database per symbol for example and not all 50 symbols in one database?

Question 3:

Not looking at how many rows it is in the database. Do you think the database is well structured for fast search queries. What I ALWAYS will search for every time is this (This will later return 5000 lines(features). Notice that I search for one symbol ONLY and a specific datetime.

I will always do this exact search, and vever any other type of search, if you have any idea how I should best structure the database with those 50 stock symbols.

I will need all 5000 rows/features, where each row is a feature that needs to be fed to the random forest algorithm. This means that each symbol and update 201012051546 have 5000 features/values.

As in Question 2. Should I have one table per symbol. Will this result in faster searches for example?

(symbol = 2, smalldatetime = 201012051546) where I want to return the featureNr and value which would be the below lines: (I will ALWAYS ONLY do this exact search)

201012051546 | 2 | 1 | 76.123456789
201012051546 | 2 | 2 | 76.123456789
201012051546 | 2 | 3 | 76.123456789

Question 4:

Wouldn't it be the most optimal to have 1 table for each symbol and datetime?

In other words: 1 table for symbol = 2 and smalldatetime 1546 which holds 5000 rows of features and then do this for each symbol and datetime?

This will result in 7,884,000 tables per symbol.

Or is this not good in any other way? Notice here that I will need to in a loop later retrieve all features(5000 per table) from all those tables(7,884,000 tables) which is very important that it goes as fast as possible. I know it might be difficult to know but how long time approx: could a process like this with my structure take with a 64 core computer?

1440 minutes(1 day) * 365 days * 15 years = 7,884,000 tables per symbol


My idea for the database/table structure:

smalldatetime | symbol (smallint) | featureNr (smallint) | value (float(53))

201012051545 | 1 | 1 | 65.123456789
201012051546 | 1 | 1 | 66.123456789
201012051547 | 1 | 1 | 67.123456789
201012051545 | 1 | 2 | 65.123456789
201012051546 | 1 | 2 | 66.123456789
201012051547 | 1 | 2 | 67.123456789
201012051545 | 1 | 3 | 65.123456789
201012051546 | 1 | 3 | 66.123456789
201012051547 | 1 | 3 | 67.123456789

201012051545 | 2 | 1 | 75.123456789
201012051546 | 2 | 1 | 76.123456789
201012051547 | 2 | 1 | 77.123456789
201012051545 | 2 | 2 | 75.123456789
201012051546 | 2 | 2 | 76.123456789
201012051547 | 2 | 2 | 77.123456789
201012051545 | 2 | 3 | 75.123456789
201012051546 | 2 | 3 | 76.123456789
201012051547 | 2 | 3 | 77.123456789

201012051545 | 3 | 1 | 85.123456789
201012051546 | 3 | 1 | 86.123456789
201012051547 | 3 | 1 | 87.123456789
201012051545 | 3 | 2 | 85.123456789
201012051546 | 3 | 2 | 86.123456789
201012051547 | 3 | 2 | 87.123456789
201012051545 | 3 | 3 | 85.123456789
201012051546 | 3 | 3 | 86.123456789
201012051547 | 3 | 3 | 87.123456789

Best Answer

A few things:

As answered in the comments, unless you are hitting this table with a LOT of reads, the number of cores is probably less important:

Your first question is answered in the documentation; you will need Enterprise Edition to use more than 24 cores. With a unique clustered index on the natural key (datetime, symbol, feature) I would expect millisecond response time with a single table using a single core. Manageability of a 50TB table is another matter. - Dan Guzman

Your proposed workload is not going to be cpu constrained. As Dan Guzman said the structure appears solid and you can expect great performance out of your specified design and queries on even modest (server grade, for sure, but modest) hardware. Make sure your disks are as fast as possible. - Jonathan Fite

You don't define the keys explicitly, although you do mention:

Notice that I search for one symbol ONLY and a specific datetime. I will always do this exact search, and Never any other type of search, if you have any idéa how I should best structure the database with those 50 stock symbols.

So:

  1. If you need to know a price as of a point in time for a particular symbol and feature, you only need to store when the price actually changes. No need to store minute by minute. In practical terms, this means you will most likely have a lot less than ~7 million rows per symbol. Even if things did change by the minute, that would only occur while the exchange was open.
  2. If the search will always be for a particular symbol, it forms part of the primary key and within the physical structure of SQL Server, it should be the leading column of the clustered index.
  3. You are correct that everything else being equal, a small row size will result in faster performance. However, that shouldn't be the overriding motivation for the selection of a key. Saving 2-3 bytes per row isn't worth it if it requires additional overhead to query or maintain uniqueness.

I'd say you're close (with a few caveats), so let's start by defining the tables:

CREATE TABLE Stock
(
  StockSymbol  CHAR(5)      NOT NULL --NYSE is 3, NASDAQ is 4-5
 ,Name         VARCHAR(50)  NOT NULL
 ,CONSTRAINT PK_Stock PRIMARY KEY (StockSymbol)
 ,CONSTRAINT AK_Stock UNIQUE (Name)
)
GO

CREATE TABLE StockFeature
(
  StockSymbol  CHAR(5)      NOT NULL
 ,FeatureNbr   SMALLINT     NOT NULL  --I'm assuming this value is understood ahead of time, otherwise consider using a shortname or code
 ,Name         VARCHAR(50)  NOT NULL
 ,CONSTRAINT FK_StockFeature_Describes_Stock FOREIGN KEY (StockSymbol) REFERENCES Stock (StockSymbol)
 ,CONSTRAINT PK_StockFeature PRIMARY KEY (StockSymbol, FeatureNbr)
 ,CONSTRAINT AK_StockFeature UNIQUE (StockSymbol, Name)
)
GO

CREATE TABLE StockFeatureValue
(
  StockSymbol  CHAR(5)       NOT NULL
 ,FeatureNbr   SMALLINT      NOT NULL
 ,ValueDtm     DATETIME2(0)  NOT NULL
 ,Value        FLOAT(53)     NOT NULL
 ,CONSTRAINT FK_StockFeatureValue_Measurement_Of_StockFeature FOREIGN KEY (StockSymbol, FeatureNbr) REFERENCES StockFeature (StockSymbol, FeatureNbr)
 ,CONSTRAINT PK_StockFeatureValue PRIMARY KEY (StockSymbol, FeatureNbr, ValueDtm)
)
GO

Now if we want the values for all features of a symbol at a given point in time, this query will return data very quickly:

SELECT
  SFVal.StockSymbol
 ,SFVal.FeatureNbr
 ,SFVal.Value
FROM
  StockFeatureValue SFVal
WHERE
  SFVal.StockSymbol = 'MSFT'
    AND SFVal.ValueDtm =
      (
        SELECT
          MAX(ValueDtm)
        FROM
          StockFeatureValue 
        WHERE
          StockSymbol = SFVal.StockSymbol
            AND FeatureNbr = SFVal.FeatureNbr
            AND ValueDtm <= '2010-12-05 15:45:00'  --Or whatever
      )

The reason this will work is the primary key is defined as (StockSymbol, FeatureNbr, ValueDtm). Order is important here, as SQL Server will store the rows (for the most part) sorted according to those values. Which means to return the necessary data, the database has to read at most 1 page per feature. In practical terms this is nothing.

We could use a SMALLINT for StockSymbol but that would require an additional query to Stock to retrieve the necessary value, which isn't the end of the world but not really worth it for 3 bytes/row.

One observation - in your sample data Value doesn't vary across features. I don't know if this is an artifact of how you generated the data, but if Value does not change according to FeatureNbr then you have a normalization error and the primary key should be (StockSymbol, ValueDtm) with the feature information stored elsewhere.

My answer to a similar question may also be of benefit: MySQL DB design for very long growing row size

Addressing a comment:

I understand, so it is better to use your example. I really need to learn all that code you did for the 3 tables, I am looking at it but are not exactly sure how everything links.(All the rows where CONSTRAINT is used. I have a problem to understand) If I understand you have linked them somehow so search queries will work in an efficient way when you execute that code for '2010-12-05 15:45:00'

Those are foreign key constraints. They do two things:

  1. Keep invalid values out of your database
  2. Inform the database engine of the relation between the tables so better query plans can be generated

Another thing I want to clarify, if our actual data is this:

StockSymbol | FeatureNbr    | ValueDtm              | Value
'MSFT'      | 1             | '2015-01-01 00:01:00' | 1.000
'MSFT'      | 1             | '2015-01-01 00:02:00' | 1.000
'MSFT'      | 1             | '2015-01-01 00:03:00' | 1.000
'MSFT'      | 1             | '2015-01-01 00:04:00' | 1.100
'MSFT'      | 1             | '2015-01-01 00:05:00' | 1.100
'MSFT'      | 1             | '2015-01-01 00:06:00' | 1.100
'MSFT'      | 1             | '2015-01-01 00:07:00' | 1.100
'MSFT'      | 1             | '2015-01-01 00:08:00' | 1.100
'MSFT'      | 1             | '2015-01-01 00:09:00' | 1.201
'MSFT'      | 1             | '2015-01-01 00:10:00' | 1.100
'MSFT'      | 2             | '2015-01-01 00:01:00' | 0.253
'MSFT'      | 2             | '2015-01-01 00:02:00' | 0.253
'MSFT'      | 2             | '2015-01-01 00:03:00' | 0.253
'MSFT'      | 2             | '2015-01-01 00:04:00' | 0.253
'MSFT'      | 2             | '2015-01-01 00:05:00' | 0.253
'MSFT'      | 2             | '2015-01-01 00:06:00' | 0.561
'MSFT'      | 2             | '2015-01-01 00:07:00' | 0.561
'MSFT'      | 2             | '2015-01-01 00:08:00' | 0.561
'MSFT'      | 2             | '2015-01-01 00:09:00' | 0.561
'MSFT'      | 2             | '2015-01-01 00:10:00' | 0.561
'MSFT'      | 3             | '2015-01-01 00:01:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:02:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:03:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:04:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:05:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:06:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:07:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:08:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:09:00' | 6.942
'MSFT'      | 3             | '2015-01-01 00:10:00' | 6.942

We would only need seven rows to represent the same amount of data:

StockSymbol | FeatureNbr    | ValueDtm              | Value
'MSFT'      | 1             | '2015-01-01 00:01:00' | 1.000
'MSFT'      | 1             | '2015-01-01 00:04:00' | 1.100
'MSFT'      | 1             | '2015-01-01 00:09:00' | 1.201
'MSFT'      | 1             | '2015-01-01 00:10:00' | 1.100
'MSFT'      | 2             | '2015-01-01 00:01:00' | 0.253
'MSFT'      | 2             | '2015-01-01 00:06:00' | 0.561
'MSFT'      | 3             | '2015-01-01 00:01:00' | 6.942

And this is the power of the data model. We don't have to store every minute of data to return the correct result. And we can return that result quickly because the clustered index allows us to find the necessary rows without reading large chunks of the table.

Addressing additional comments:

I know this is a beginner question. But I don't really understand why 3 tables are used. Why couldn't we just use one table and put everything in that table? When using 3 tables. StockSymbol, FeatureNbr and Name exists in 2-3 tables. Doesn't this mean that the database will take up more memory as those becomes duplicates? I also wonder why Name is a column. StockSymbol should be the name of the symbol?

This is a result of normalization. We normalize the data to ensure the data contained within is consistent with our understanding. So yes, we could store the data in a single table, but we would have no way of guaranteeing the data is consistent.

The Stock table guarantees that only valid stock symbols are inserted into StockFeature, which guarantees that only valid stock symbols/features are inserted into our final table.

If I Google MSFT, it shows 'MSFT - Microsoft Corporation'. 'Microsoft Corporation' is the Name of the Stock. This is data the belongs to the Stock alone - we would not store it anywhere else.

There's not a lot of good information on database normalization (a lot varies between okay and really bad), but if you don't mind digging into some heavy reading you can start with EF Codd's work and then maybe move on to one of CJ Date's books, although I find them lacking in some regards (his treatment of time-dependent data is... bad).

You mentioned: this means you will most likely have a lot less than ~7 million rows per symbol Notice that it would be: 39,420,000,000 rows per symbol in one table as I calculated in my original post. Is this a problem for this search algorithm to have so many rows in one table?

See the above example for why you would most likely have far less than ~39 billion rows. You would only have that many rows if each feature changed values every minute. And even if this was the case (it most likely isn't), the above solution will still work as the right clustered index will allow the database to find the necessary pages quickly.

The search algorithm in this case is a B-Tree, and the overall size of the table generally will not impact individual lookups. The total row size impacts maintenance, aggregation, and queries that cannot utilize an efficient indexing strategy, so we don't want to make things bigger than they have to be.