- My computer has 64 cores
- Microsoft SQL Server Data Tools 16.0.62007.23150 installed
- I do have 500 Mb/s SSD for the moment
One initial question: Which SQL version would be best for 64 cores?
I am new to SQL databases and have understood that it is important how you structure the database so it will go faster later to search and extract the information needed (queries).
All data will be used only for CPU calculations and will not be displayed visually in any dataGridView, report etc. The data will be used for artificial intelligence/random forest.
I believe I have also understood that using data types that take up less memory is good for speed later on also, like using a smallint
instead of an int
if it will work with a smallint
etc.
I like to ask if my structure that I am thinking of is well designed in order to extract information later or if I should do this a bit differently. The database will add stock symbol data and as I notice this database will be extremely big which is the purpose of this question.
This is the whole structure that I have in mind (Example comes after explanation):
- I will use 4 columns. (DateTime|Symbol|FeatureNr|Value)
- DateTime has format down to the minute:
201012051545
- Symbol and FeatureNr has
smallint
. For example: MSFT = 1, IBM = 2, AAPL = 3. So as you see. Instead of using strings in the columns, I have putsmallint
that represent those symbols/featureNr. This so search Queries goes faster later. - The database will for example have 50 symbols where each symbol has 5000 features.
- The database will have 15 years of data.
Now I have a few big questions:
If we just filling this database with data for 1 symbol. It will be this many rows in the database:
1440 minutes(1 day) * 365 days * 15 years * 5000 features = 39,420,000,000
Question 1:
39,420,000,000 rows in a database seems like a lot or is this no problem?
Question 2:
The above was just for 1 symbol. Now I had 50 symbols which would mean:
39,420,000,000 * 50 = 1,971,000,000,000 rows.
I don't know what to say about this. Is this to many rows or is it okay? Should I have 1 database per symbol for example and not all 50 symbols in one database?
Question 3:
Not looking at how many rows it is in the database. Do you think the database is well structured for fast search queries. What I ALWAYS will search for every time is this (This will later return 5000 lines(features). Notice that I search for one symbol ONLY and a specific datetime.
I will always do this exact search, and vever any other type of search, if you have any idea how I should best structure the database with those 50 stock symbols.
I will need all 5000 rows/features, where each row is a feature that needs to be fed to the random forest algorithm. This means that each symbol and update 201012051546
have 5000 features/values.
As in Question 2. Should I have one table per symbol. Will this result in faster searches for example?
(symbol = 2, smalldatetime = 201012051546) where I want to return the featureNr and value which would be the below lines: (I will ALWAYS ONLY do this exact search)
201012051546 | 2 | 1 | 76.123456789
201012051546 | 2 | 2 | 76.123456789
201012051546 | 2 | 3 | 76.123456789
Question 4:
Wouldn't it be the most optimal to have 1 table for each symbol and datetime?
In other words: 1 table for symbol = 2 and smalldatetime 1546 which holds 5000 rows of features and then do this for each symbol and datetime?
This will result in 7,884,000 tables per symbol.
Or is this not good in any other way? Notice here that I will need to in a loop later retrieve all features(5000 per table) from all those tables(7,884,000 tables) which is very important that it goes as fast as possible. I know it might be difficult to know but how long time approx: could a process like this with my structure take with a 64 core computer?
1440 minutes(1 day) * 365 days * 15 years = 7,884,000 tables per symbol
My idea for the database/table structure:
smalldatetime | symbol (smallint) | featureNr (smallint) | value (float(53)) 201012051545 | 1 | 1 | 65.123456789 201012051546 | 1 | 1 | 66.123456789 201012051547 | 1 | 1 | 67.123456789 201012051545 | 1 | 2 | 65.123456789 201012051546 | 1 | 2 | 66.123456789 201012051547 | 1 | 2 | 67.123456789 201012051545 | 1 | 3 | 65.123456789 201012051546 | 1 | 3 | 66.123456789 201012051547 | 1 | 3 | 67.123456789 201012051545 | 2 | 1 | 75.123456789 201012051546 | 2 | 1 | 76.123456789 201012051547 | 2 | 1 | 77.123456789 201012051545 | 2 | 2 | 75.123456789 201012051546 | 2 | 2 | 76.123456789 201012051547 | 2 | 2 | 77.123456789 201012051545 | 2 | 3 | 75.123456789 201012051546 | 2 | 3 | 76.123456789 201012051547 | 2 | 3 | 77.123456789 201012051545 | 3 | 1 | 85.123456789 201012051546 | 3 | 1 | 86.123456789 201012051547 | 3 | 1 | 87.123456789 201012051545 | 3 | 2 | 85.123456789 201012051546 | 3 | 2 | 86.123456789 201012051547 | 3 | 2 | 87.123456789 201012051545 | 3 | 3 | 85.123456789 201012051546 | 3 | 3 | 86.123456789 201012051547 | 3 | 3 | 87.123456789
Best Answer
A few things:
As answered in the comments, unless you are hitting this table with a LOT of reads, the number of cores is probably less important:
You don't define the keys explicitly, although you do mention:
So:
I'd say you're close (with a few caveats), so let's start by defining the tables:
Now if we want the values for all features of a symbol at a given point in time, this query will return data very quickly:
The reason this will work is the primary key is defined as
(StockSymbol, FeatureNbr, ValueDtm)
. Order is important here, as SQL Server will store the rows (for the most part) sorted according to those values. Which means to return the necessary data, the database has to read at most 1 page per feature. In practical terms this is nothing.We could use a
SMALLINT
forStockSymbol
but that would require an additional query toStock
to retrieve the necessary value, which isn't the end of the world but not really worth it for 3 bytes/row.One observation - in your sample data
Value
doesn't vary across features. I don't know if this is an artifact of how you generated the data, but ifValue
does not change according toFeatureNbr
then you have a normalization error and the primary key should be(StockSymbol, ValueDtm)
with the feature information stored elsewhere.My answer to a similar question may also be of benefit: MySQL DB design for very long growing row size
Addressing a comment:
Those are foreign key constraints. They do two things:
Another thing I want to clarify, if our actual data is this:
We would only need seven rows to represent the same amount of data:
And this is the power of the data model. We don't have to store every minute of data to return the correct result. And we can return that result quickly because the clustered index allows us to find the necessary rows without reading large chunks of the table.
Addressing additional comments:
This is a result of normalization. We normalize the data to ensure the data contained within is consistent with our understanding. So yes, we could store the data in a single table, but we would have no way of guaranteeing the data is consistent.
The
Stock
table guarantees that only valid stock symbols are inserted intoStockFeature
, which guarantees that only valid stock symbols/features are inserted into our final table.If I Google MSFT, it shows 'MSFT - Microsoft Corporation'. 'Microsoft Corporation' is the
Name
of theStock
. This is data the belongs to theStock
alone - we would not store it anywhere else.There's not a lot of good information on database normalization (a lot varies between okay and really bad), but if you don't mind digging into some heavy reading you can start with EF Codd's work and then maybe move on to one of CJ Date's books, although I find them lacking in some regards (his treatment of time-dependent data is... bad).
See the above example for why you would most likely have far less than ~39 billion rows. You would only have that many rows if each feature changed values every minute. And even if this was the case (it most likely isn't), the above solution will still work as the right clustered index will allow the database to find the necessary pages quickly.
The search algorithm in this case is a B-Tree, and the overall size of the table generally will not impact individual lookups. The total row size impacts maintenance, aggregation, and queries that cannot utilize an efficient indexing strategy, so we don't want to make things bigger than they have to be.