Redshift postgresql time series data index DISTKEY and SORTKEY

redshifttime-series-database

I have around 6 billion rows of data to move to a redshift DB from MSSQL. The data will have three columns.

I plan to move the data one tag at a time as there are 4 years of history at one-minute intervals so the data will be out of time order, as this will be easier than moving all 4000 at once for a given time period in chunks.

I have read about DISTKEY and SORTKEY although not sure how I would implement for the best performance. Would anyone have any advice, should I split the table up into multiple tables so that it is not so long and how would I use DISTKEY and SORTKEY to improve performance.

Note: Redshift postgresql is the only option I have been given at this points.

Thank You

Best Answer

The choice of distribution key and sort key are driven by the queries you intend to issue on the table, for the choice made defines what queries can be issued in a timely manner on the table, and you need those queries to be the business queries you need to run. You cannot chose either of these keys simply by taking a table on its own, without any consideration for how the table will be queried.

As an aside, I would strongly advise you, if your example value column here is representative, to convert your value column to an integer, by multiplying its values by 1000, and converting it back to a float when you need to use it.

This is because the sorting value for floats is the integer part of the number, and so all those example numbers have a sorting value of 1; which is to say, if most of your values are like that, min-max culling (the zone map) will have basically no effect, and that will make your queries behave much as they did on MS SQL - the entire column will be scanned - only the and expensive brute power of the hardware in the cluster will provide an improvement in performance.

Ah, and, finally, if you insert your data out of order, and you have six billion columns, you'd going to have a multi-day VACUUM after you've finished inserting your data.

Redshift is a knowledge-intensive database, which can handle Big Data in a timely manner when and only when it is operated correctly, and although I may be wrong, I think the people having you perform this work do not understand what they are getting into.

Related Solutions

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

I think your understanding of query queues is a little off.

A queue is like a thread in Java. A query arrives and is designated to the "less loaded" queue, and it waits for its turn to be resolved.

The decision of where to put a query is independent of how busy a queue is; queries are allocated based on the rules you've set up: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-queue-assignment-rules.html

We can have as many queues as we like.

Not quite, check out http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html. The relevant line is

The maximum total concurrency level for all user-defined queues, not including the reserved Superuser queue, is 50.

So the concurrency is definitely limited, but that's a reasonable restriction because each concurrency slot reserves some of the cluster's resources.

A queue has some memory allocated (we guess divided equally?)

By default memory is split equally among queues, but you can assign memory at the granularity of 1% of cluster memory. See the above link about defining query queues.

In a queue we can assign user groups or queries groups. But in short-term, that's a lot of classification work in our queries we can't do right now.

Creating user groups and setting query groups is actually pretty simple, see http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_GROUP.html and http://docs.aws.amazon.com/redshift/latest/dg/r_query_group.html.

If you didn't set up user groups or query groups, then my guess is that you didn't see any improvement after adding the extra queues because all the queries still ran in a single query queue. (In fact, the extra queues would simply take resources from the queue running your queries.) Since all your queries are running from the same user and for the same purpose, it makes sense that they should all resolve to the same query queue. A better solution might be to have only a single queue, but increase the concurrency level. (Keep in mind that a query queue's resources are divided equally among all concurrency slots, even if they're not being used, so a concurrency level of 50 means no query gets more than 2% of the total resources.)

All that being said, redshift is a higher-latency system than many other database solutions. If you're only looking to run lots of small queries, then it might not be the best fit for your problem.

Median Absolute Deviation for Time Series Outlier Detection in Amazon Redshift

I believe this is very doable with a CTE or subquery!

MAD is can be calculated by composing several Redshift functions:

the median of a list of values
absolute value of the difference between each value and that median
the median of those values

I wrote this in the form of:

WITH
medians AS (
  SELECT
    t.values,
    MEDIAN(t.values) OVER () as median_value,
    ABS(t.values-MEDIAN(t.values) OVER ()) AS absolute_deviation
  FROM table AS t
  GROUP BY t.values
)
SELECT
  MEDIAN(absolute_deviation) OVER () AS median_absolute_deviation
FROM medians

Hope this helps

Best Answer

Related Solutions

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

Median Absolute Deviation for Time Series Outlier Detection in Amazon Redshift

Related Question