Redshift Queues

redshift

I'm brand new to RedShift and am trying to debug why our systems are running very slowly. I've noticed that we have around 50 users and hundreds of daily reports running, all of which take substantial execution time. I believe that part of the problem my reside in the problem with too many concurrent queries running and slowing the system.

I would like to optimize the queuing system so that certain users/applications are give priority, while low priority reports that are less seldom run will go into a different queue.

How can I view the current queues and setup new ones? Further, is it possible to restrict individual users to a particular queue so that every query that they run will be in that particular queue? Thanks.

Best Answer

Amazon provides WLM (Work Load Management) specifically for this task.

This allows you to allocate memory and other resources like setting concurrency, setting timeout values etc. If you have access to AWS Redshift Console, you can easily assign a parameter group to a cluster and then browse through Parameter Groups > WLM and set below WLM parameters for that particular cluster -

Concurrency - Max number of queries which can run concurrently.

User Groups - You need to create user group like (report_gr, etl_gr, default_gr etc) and assign users to those groups accordingly.

Timeout - Timeout value for that user group's queries

Memory - Percentage of memory allocated for that user group's queries.

Related Solutions

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

I think your understanding of query queues is a little off.

A queue is like a thread in Java. A query arrives and is designated to the "less loaded" queue, and it waits for its turn to be resolved.

The decision of where to put a query is independent of how busy a queue is; queries are allocated based on the rules you've set up: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-queue-assignment-rules.html

We can have as many queues as we like.

Not quite, check out http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html. The relevant line is

The maximum total concurrency level for all user-defined queues, not including the reserved Superuser queue, is 50.

So the concurrency is definitely limited, but that's a reasonable restriction because each concurrency slot reserves some of the cluster's resources.

A queue has some memory allocated (we guess divided equally?)

By default memory is split equally among queues, but you can assign memory at the granularity of 1% of cluster memory. See the above link about defining query queues.

In a queue we can assign user groups or queries groups. But in short-term, that's a lot of classification work in our queries we can't do right now.

Creating user groups and setting query groups is actually pretty simple, see http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_GROUP.html and http://docs.aws.amazon.com/redshift/latest/dg/r_query_group.html.

If you didn't set up user groups or query groups, then my guess is that you didn't see any improvement after adding the extra queues because all the queries still ran in a single query queue. (In fact, the extra queues would simply take resources from the queue running your queries.) Since all your queries are running from the same user and for the same purpose, it makes sense that they should all resolve to the same query queue. A better solution might be to have only a single queue, but increase the concurrency level. (Keep in mind that a query queue's resources are divided equally among all concurrency slots, even if they're not being used, so a concurrency level of 50 means no query gets more than 2% of the total resources.)

All that being said, redshift is a higher-latency system than many other database solutions. If you're only looking to run lots of small queries, then it might not be the best fit for your problem.

Redshift – Optimize Expensive Query

The main thing is to avoid the nested loop join that is caused by the "between" in the join condition.

In your example specifically, I would start by rewriting this as

SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip

Note that Redshift will only do a full outer join if considers it a merge joinable condition, which means you should set your distribution and sort key for both tables to be on visitor.ip and geo_ip.start_ip, respectively. If this is not an option, or too much of a bother, you can do this instead as a UNION ALL since you don't actually need the geo and visitor records to be on the same row:

SELECT 
  visitor_id,
  visitor_ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    ORDER BY ip, CASE WHEN zip IS NULL THEN 0 ELSE 1 END
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM (
  SELECT null as visitor_id, start_ip as ip, zip
  FROM geo_ip
  UNION ALL
  SELECT id as visitor_id, ip, null as zip
  FROM visitor
)

These would get rid of the nested loop join. You can still improve upon this for larger clusters by allowing the operation to be distributable.

For example, if you can partition your IP address table by the first octet (i.e, no row in the table has a start and end of the range with different first octets), you could add a partition to your window function, which should let the sorting be processed on separate nodes in your cluster:

SELECT 
  visitor.id,
  visitor.ip,
  LAST_VALUE(zip IGNORE NULLS) OVER (
    PARTITION BY (COALESCE(geo_ip.start_ip, visitor.ip) / 1000000000 )::int
    ORDER BY COALESCE(geo_ip.start_ip, visitor.ip)
    ROWS UNBOUNDED PRECEDING
  ) as zip
FROM geo_ip
FULL OUTER JOIN visitor
ON visitor.ip=geo_ip.start_ip

Best Answer

Related Solutions

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

Redshift – Optimize Expensive Query

Related Question