Is it appropriate to use a timestamp as a DISTKEY in Redshift

redshift

I'm having a little trouble understanding how to select a DISTKEY for a table I'm working on.

Consider the following table:

create table test_table (
    country     char(2)       encode zstd,
    record_time bigint        encode zstd not null,
    ip          bigint        encode zstd,
    identifier  varchar(41)   encode zstd not null,
    lat         numeric(10,3) encode zstd,
    long        numeric(10,3) encode zstd,
    PRIMARY KEY (event_time, hash)
)
DISTKEY(event_time)
SORTKEY(country, event_time, hash)

My understanding is that DISTKEYs are only of real importance if a table is to be joined with others.

This table will be the only one on its cluster, and thus, wont be joined with other tables. Since that's the case, am I right in assuming that a DISTKEY is unnecessary/redundant, or does a DISTKEY affect more than meets the eye?

Best Answer

It's pretty corner-case, but there are cases where data location matters slightly.

Let's say you ask the database to do this query:

SELECT COUNT(DISTINCT ip) FROM test_table GROUP BY country

If the table is distributed by country, no network activity is needed (I tested this to confirm). For any other distribution style, the hash table will logically need to be re-distributed over the network (I also tested this to confirm).

That said, you probably want to just choose an EVEN distribution style to maximize the scanning speed. For that matter, maybe you want to use Spectrum for this use case.

Related Question