I'm having a little trouble understanding how to select a DISTKEY for a table I'm working on.
Consider the following table:
create table test_table (
country char(2) encode zstd,
record_time bigint encode zstd not null,
ip bigint encode zstd,
identifier varchar(41) encode zstd not null,
lat numeric(10,3) encode zstd,
long numeric(10,3) encode zstd,
PRIMARY KEY (event_time, hash)
)
DISTKEY(event_time)
SORTKEY(country, event_time, hash)
My understanding is that DISTKEYs are only of real importance if a table is to be joined with others.
This table will be the only one on its cluster, and thus, wont be joined with other tables. Since that's the case, am I right in assuming that a DISTKEY is unnecessary/redundant, or does a DISTKEY affect more than meets the eye?
Best Answer
It's pretty corner-case, but there are cases where data location matters slightly.
Let's say you ask the database to do this query:
SELECT COUNT(DISTINCT ip) FROM test_table GROUP BY country
If the table is distributed by country, no network activity is needed (I tested this to confirm). For any other distribution style, the hash table will logically need to be re-distributed over the network (I also tested this to confirm).
That said, you probably want to just choose an EVEN distribution style to maximize the scanning speed. For that matter, maybe you want to use Spectrum for this use case.