Amazon Redshift and PostgreSQL – How Can Amazon Redshift Be Built on PostgreSQL if Postgres is Row-Based?

data-warehousepostgresqlredshift

According to Amazon, Redshift is based on PostgreSQL, and is a column-oriented database management system. This looks to me as a very self-conflicting statement, isn't Postgres a row-oriented database?

Best Answer

A Postgres server is a lot of things, not just row-oriented access methods:

(The image above is from the Postgres documentation.)

Postgres source code is available for anyone to use under a very permissive license. To implement a DBMS that is "based on PostgreSQL" you don't have to rewrite it from scratch. Say, if you want to introduce a different layout for storing your data you can likely reuse with little or no changes the main server process, client APIs, query parser and rewriter, most of utility and security functions. You might need to modify the plan generator and executor, while adding new access methods and replacing parts of the page storage manager. Given the resources at Amazon's disposal, this doesn't look like an impossible undertaking.

Amazon basically say as much:

the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. [...] Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted [...]

Related Solutions

PostgreSQL – How to Get pg_archivecleanup on Amazon Linux 2014.03

With the help of this AWS forum thread, I realized that this line in /etc/yum.conf

exclude=postgresql*

excludes not only updates but installs. Once I commented that out (temporarily), I was able to use

sudo yum install postgresql-contrib

to install postgresql-contrib directly from the Amazon repo--no need for changing the repository or the operating system.

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

I think your understanding of query queues is a little off.

A queue is like a thread in Java. A query arrives and is designated to the "less loaded" queue, and it waits for its turn to be resolved.

The decision of where to put a query is independent of how busy a queue is; queries are allocated based on the rules you've set up: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-queue-assignment-rules.html

We can have as many queues as we like.

Not quite, check out http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html. The relevant line is

The maximum total concurrency level for all user-defined queues, not including the reserved Superuser queue, is 50.

So the concurrency is definitely limited, but that's a reasonable restriction because each concurrency slot reserves some of the cluster's resources.

A queue has some memory allocated (we guess divided equally?)

By default memory is split equally among queues, but you can assign memory at the granularity of 1% of cluster memory. See the above link about defining query queues.

In a queue we can assign user groups or queries groups. But in short-term, that's a lot of classification work in our queries we can't do right now.

Creating user groups and setting query groups is actually pretty simple, see http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_GROUP.html and http://docs.aws.amazon.com/redshift/latest/dg/r_query_group.html.

If you didn't set up user groups or query groups, then my guess is that you didn't see any improvement after adding the extra queues because all the queries still ran in a single query queue. (In fact, the extra queues would simply take resources from the queue running your queries.) Since all your queries are running from the same user and for the same purpose, it makes sense that they should all resolve to the same query queue. A better solution might be to have only a single queue, but increase the concurrency level. (Keep in mind that a query queue's resources are divided equally among all concurrency slots, even if they're not being used, so a concurrency level of 50 means no query gets more than 2% of the total resources.)

All that being said, redshift is a higher-latency system than many other database solutions. If you're only looking to run lots of small queries, then it might not be the best fit for your problem.

Best Answer

Related Solutions

PostgreSQL – How to Get pg_archivecleanup on Amazon Linux 2014.03

Amazon-rds – Amazon Redshift – The difference between Query Slots, Concurrency and Queues

Related Question