AWS Redshift – Managing Deep Copy Disk Space

awsredshift

I am trying to perform a deep copy on RS. See http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

I created the new/temporary table using the SAME DDL with compression settings as the original table.

I have 3 nodes, each node is using 21% disk space.

When I run the copy:

insert into events_tmp (select * from events);

I run out of disk space.
Does the deep copy not compress the data?

Thanks!

Best Answer

See vceron's post on https://discourse.snowplowanalytics.com/t/redshift-maintenance-best-practices/1124/2

When doing a Deep Copy in a highly unsorted table Redshift needs to sort this table before inserting into the new one.

This sort operation will be taken place in a intermediate temporary table, which at first be placed in memory, but as the data set is too big, will eventually spillover to disk.

However, those temporary tables are not compressed, and Redshift will allocate temporary disk space for that operation which results in disk full error if there is not sufficient space for the temporary data.

Related Solutions

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

I think your understanding of query queues is a little off.

A queue is like a thread in Java. A query arrives and is designated to the "less loaded" queue, and it waits for its turn to be resolved.

The decision of where to put a query is independent of how busy a queue is; queries are allocated based on the rules you've set up: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-queue-assignment-rules.html

We can have as many queues as we like.

Not quite, check out http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html. The relevant line is

The maximum total concurrency level for all user-defined queues, not including the reserved Superuser queue, is 50.

So the concurrency is definitely limited, but that's a reasonable restriction because each concurrency slot reserves some of the cluster's resources.

A queue has some memory allocated (we guess divided equally?)

By default memory is split equally among queues, but you can assign memory at the granularity of 1% of cluster memory. See the above link about defining query queues.

In a queue we can assign user groups or queries groups. But in short-term, that's a lot of classification work in our queries we can't do right now.

Creating user groups and setting query groups is actually pretty simple, see http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_GROUP.html and http://docs.aws.amazon.com/redshift/latest/dg/r_query_group.html.

If you didn't set up user groups or query groups, then my guess is that you didn't see any improvement after adding the extra queues because all the queries still ran in a single query queue. (In fact, the extra queues would simply take resources from the queue running your queries.) Since all your queries are running from the same user and for the same purpose, it makes sense that they should all resolve to the same query queue. A better solution might be to have only a single queue, but increase the concurrency level. (Keep in mind that a query queue's resources are divided equally among all concurrency slots, even if they're not being used, so a concurrency level of 50 means no query gets more than 2% of the total resources.)

All that being said, redshift is a higher-latency system than many other database solutions. If you're only looking to run lots of small queries, then it might not be the best fit for your problem.

Issue using COPY command with AWS Redshift

Perhaps the JSON you have is not quite correct. When I paste your JSON into jsonlint.com I get the following:

Parse error on line 7: 
...        "username": “user23"} ] }
-----------------------^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '['

The AWS Redshift documentation on COPY SSH parameters has the following

The quote characters must be simple quotation marks (0x22), not slanted or "smart" quotation marks.

So double check you quote marks around the username you've provided, and if they are how you've provided them above, change it from to “user23" >> "user23" (note the first quote mark is different).

The first quote mark in your code is unicode : U+201C and not U+0022

Best Answer

Related Solutions

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

Issue using COPY command with AWS Redshift

Related Question