Redshift postgresql time series data index DISTKEY and SORTKEY

redshifttime-series-database

I have around 6 billion rows of data to move to a redshift DB from MSSQL. The data will have three columns.

enter image description here

I plan to move the data one tag at a time as there are 4 years of history at one-minute intervals so the data will be out of time order, as this will be easier than moving all 4000 at once for a given time period in chunks.

I have read about DISTKEY and SORTKEY although not sure how I would implement for the best performance. Would anyone have any advice, should I split the table up into multiple tables so that it is not so long and how would I use DISTKEY and SORTKEY to improve performance.

Note: Redshift postgresql is the only option I have been given at this points.

Thank You

Best Answer

The choice of distribution key and sort key are driven by the queries you intend to issue on the table, for the choice made defines what queries can be issued in a timely manner on the table, and you need those queries to be the business queries you need to run. You cannot chose either of these keys simply by taking a table on its own, without any consideration for how the table will be queried.

As an aside, I would strongly advise you, if your example value column here is representative, to convert your value column to an integer, by multiplying its values by 1000, and converting it back to a float when you need to use it.

This is because the sorting value for floats is the integer part of the number, and so all those example numbers have a sorting value of 1; which is to say, if most of your values are like that, min-max culling (the zone map) will have basically no effect, and that will make your queries behave much as they did on MS SQL - the entire column will be scanned - only the and expensive brute power of the hardware in the cluster will provide an improvement in performance.

Ah, and, finally, if you insert your data out of order, and you have six billion columns, you'd going to have a multi-day VACUUM after you've finished inserting your data.

Redshift is a knowledge-intensive database, which can handle Big Data in a timely manner when and only when it is operated correctly, and although I may be wrong, I think the people having you perform this work do not understand what they are getting into.