Is it appropriate to use a timestamp as a DISTKEY in Redshift

redshift

I'm having a little trouble understanding how to select a DISTKEY for a table I'm working on.

Consider the following table:

create table test_table (
    country     char(2)       encode zstd,
    record_time bigint        encode zstd not null,
    ip          bigint        encode zstd,
    identifier  varchar(41)   encode zstd not null,
    lat         numeric(10,3) encode zstd,
    long        numeric(10,3) encode zstd,
    PRIMARY KEY (event_time, hash)
)
DISTKEY(event_time)
SORTKEY(country, event_time, hash)

My understanding is that DISTKEYs are only of real importance if a table is to be joined with others.

This table will be the only one on its cluster, and thus, wont be joined with other tables. Since that's the case, am I right in assuming that a DISTKEY is unnecessary/redundant, or does a DISTKEY affect more than meets the eye?

Best Answer

It's pretty corner-case, but there are cases where data location matters slightly.

Let's say you ask the database to do this query:

SELECT COUNT(DISTINCT ip) FROM test_table GROUP BY country

If the table is distributed by country, no network activity is needed (I tested this to confirm). For any other distribution style, the hash table will logically need to be re-distributed over the network (I also tested this to confirm).

That said, you probably want to just choose an EVEN distribution style to maximize the scanning speed. For that matter, maybe you want to use Spectrum for this use case.

Related Solutions

Redshift Queues

Amazon provides WLM (Work Load Management) specifically for this task.

This allows you to allocate memory and other resources like setting concurrency, setting timeout values etc. If you have access to AWS Redshift Console, you can easily assign a parameter group to a cluster and then browse through Parameter Groups > WLM and set below WLM parameters for that particular cluster -

Concurrency - Max number of queries which can run concurrently.

User Groups - You need to create user group like (report_gr, etl_gr, default_gr etc) and assign users to those groups accordingly.

Timeout - Timeout value for that user group's queries

Memory - Percentage of memory allocated for that user group's queries.

Redshift – Optimize Expensive Query

The main thing is to avoid the nested loop join that is caused by the "between" in the join condition.

In your example specifically, I would start by rewriting this as

SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip

Note that Redshift will only do a full outer join if considers it a merge joinable condition, which means you should set your distribution and sort key for both tables to be on visitor.ip and geo_ip.start_ip, respectively. If this is not an option, or too much of a bother, you can do this instead as a UNION ALL since you don't actually need the geo and visitor records to be on the same row:

SELECT 
  visitor_id,
  visitor_ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY ip, CASE WHEN zip IS NULL THEN 0 ELSE 1 END
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM (
  SELECT null as visitor_id, start_ip as ip, zip
  FROM geo_ip
  UNION ALL
  SELECT id as visitor_id, ip, null as zip
  FROM visitor
)

These would get rid of the nested loop join. You can still improve upon this for larger clusters by allowing the operation to be distributable.

For example, if you can partition your IP address table by the first octet (i.e, no row in the table has a start and end of the range with different first octets), you could add a partition to your window function, which should let the sorting be processed on separate nodes in your cluster:

SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    PARTITION BY (COALESCE(geo_ip.start_ip, visitor.ip) / 1000000000 )::int
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip

Best Answer

Related Solutions

Redshift Queues

Redshift – Optimize Expensive Query

Related Question