PostgreSQL – How to Force Order of WHERE Clauses

order-bypostgresql

I have 6 columns in Postgres table:

A1 character varying(5)[]
A2 character varying(5)[]
A3 int REFERENCES ... -- FK 
B1 character varying(5)[]
B2 character varying(5)[]
B3 int REFERENCES ... -- FK

I need a SELECT where the first row matched is the winner (limit 1)
with matching to A group and B group.

I know Postgres doesn't care about the order of WHERE clauses and I must prepare an ORDER clause or find a different approach.

I want to prepare lookup with priority of matching, my importance of WHERE clauses is as follow:

highest priority: (A1 and B1) OR
.                 (A1 and B2 OR A2 and B1) OR
.                 (A1 and B3 OR A3 and B1) OR
.                 (A2 and B3 OR A3 and B2) OR
lowest priority:  (A3 and B3)

The query is matching with 6 values, like:
a1 to A1, a2 to A2, a3 to A3, b1 to B1, b2 to B2, b3 to B3

So A1 and B1 means a1 matched with A1 and b1 matched with B1,

in SQL this part is written as:

("A1" @> ARRAY['19956']::varchar(5)[] AND "B1" @>
ARRAY['27407']::varchar(5)[])

with a1='19956', b1='27407'.

Is it possible to prepare it as single query, despite declarative aspect of SQL? I am considering 5 joins on the same table, but maybe there is an easier way.

Best Answer

This information is crucial:

the first row matched is the winner (limit 1)

You do not actually need any sort order for a single result. ORDER BY is just one idea how to solve the task. I suggest a completely different approach, probably (much) faster:

SELECT * FROM tbl WHERE A1 AND B1
UNION ALL
SELECT * FROM tbl WHERE A1 AND B2 OR A2 and B1
UNION ALL
SELECT * FROM tbl WHERE A1 AND B3 OR A3 and B1
UNION ALL
SELECT * FROM tbl WHERE A2 AND B3 OR A3 and B2
UNION ALL
SELECT * FROM tbl WHERE A3 AND B3
LIMIT 1;

This is assuming each SELECT can only return a single row. Else you get an arbitrary pick or you need to add ORDER BY to individual SELECTs enclosed in parentheses for a deterministic pick. (See linked answers below.)

The beauty of it: Postgres stops evaluating as soon as the first row is found. Test with EXPLAIN ANALYZE to see "never executed" for trailing SELECTs. So you can use indexes, which might mean orders of magnitude in performance.

And you can conveniently return a default row if nothing matches. Just append one more SELECT before LIMIT 1.

BTW, this is a single query.

Postgresql – How to order by Levenshtein distance

You didn't tell us what the "failed attempt" means. But something like this should work:

with ldist as (
   select name, 
          levenshtein(substring(name,1,200), lag(name) OVER (order by name))  as distance
   FROM books 
   WHERE name <> ''
) 
select * 
from ldist
order by distance;

you don't want partition by name because that essentially puts every name into a single group and thus there is no "previous" row to.

As you you want the "previous" row based on the ordering of the name column, you need an order by name in the window definition.

PostgreSQL – Do Fixed-Width Rows Improve Read Performance?

Do tables with only fixed width values perform read queries better than those with varying widths?

Basically no. There are very minor costs when accessing columns, but you won't be able to measure any difference. Details:

Does the order of columns in a Postgres table impact performance?

In particular:

There is no difference in performance between character varying(255) and text at all. You seem to be under the impression that varchar(255) (unlike text) might be a "fixed-width" type, but that is not so. Both are variable-length types, varchar(255) just adds a maximum length check:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

_{The use of varchar(255) in a table definition typically indicates a lack of understanding of the Postgres type system. The architect behind it is most probably not a native speaker - or the layout has been carried over from another RDBMS like SQL Server where this used to matter.}

Your most expensive query SELECT COUNT(*) FROM articles does not even consider row data at all, only the total size matters indirectly. Counting all rows is costly in Postgres due to its MVCC model. Maybe an estimate is good enough, which can be had very cheaply?
Fast way to discover the row count of a table

(Pretend disk space isn't an issue.)

Disk space is always an issue, even if you have plenty. The size on disk (number of data pages that have to be read / processed / written) is one of the most important factors for performance.

Where can I learn more about the internals of the Postgres DB engine?

The info page for the tag postgres has the most important links to more information, including books, the Postgres Wiki and the excellent manual. The latter is my personal favorite.

Your third query has issues

SELECT * FROM articles WHERE user_id = $1 ORDER BY published_date DESC LIMIT 1;

ORDER BY published_date DESC, but published_date can be NULL (no NOT NULL constraint). That's a loaded foot-gun if there can be NULL values, unless you prefer NULL values over the latest actual published_date.

Either add a NOT NULL constraint. Always do that for columns that can't be NULL.
Or make that ORDER BY published_date DESCNULLS LAST and adapt the index accordingly.

"articles_user_id_published_date_idx" btree (user_id, published_date DESC NULLS LAST)

Details in this recent, related answer:

Extremely slow query on indexed column in Postgres

Convert `published_date` to an actual `date`

While 'published_date' is always rounded, it's effectively just a date which occupies 4 bytes instead of 8 for the timestamp. You would best move that up in the table definition to come before the two timestamp columns, so you don't lose the 4 bytes to padding:

...
body           | text
published_date | date   --     <---- here
created_at     | timestamp without time zone
updated_at     | timestamp without time zone

Smaller on-disk storage does make a difference for performance.

Configuring PostgreSQL for read performance

More importantly, your index on (user_id, published_date) would now just occupy 32 bytes per index entry instead of 40, because 2x4 bytes do not incur extra padding. And that would make a noticeable difference for performance.

Aside: this index is not relevant to the demonstrated queries. Delete unless indexes unless used elsewhere:

~~"index_articles_on_published_date" btree (published_date)~~

Best Answer

Related Solutions

Postgresql – How to order by Levenshtein distance

PostgreSQL – Do Fixed-Width Rows Improve Read Performance?

Your third query has issues

Convert published_date to an actual date

Related Question

Convert `published_date` to an actual `date`