Postgresql – Postgesql non-deterministic order of result with LIMIT and ORDER BY

postgresql

I understand that Postgresql may return results in non-deterministic order when the query parameters are non-unique. This is especially a problem when we use LIMIT since this may return entirely different results. So the suggestion for safety is to use ORDER BY with LIMIT.

However, what if we use ORDER BY on an non-unique column name? Will this produce a deterministic result or no? Let me demonstrate with an example:

We have the following data:

id | name | index
1    foo     5
1    bar     2
1    test    5

If I query SELECT * FROM <table> WHERE id = 1 ORDER BY index LIMIT 1;. Will this produce the same result every time?

Best Answer

If there are multiple rows with the lowest value for index, then no, there is no guarantee you will always get the same row. If you only specify ordering for a column that contains duplicates, then the database does not have to provide anything more than any row that meets the criteria.

Same would be true if you were to say ORDER BY index DESC - which row with 5 would you expect? Why?

In order to force determinism / predictability, add a tie-breaker on a column that is unique (or at least unique within the same values of index). With your sample data, you could say:

ORDER BY index, name LIMIT 1;

But you haven’t shared enough about your table structure and constraints to get a more specific suggestion than that.

Related Solutions

PostgreSQL – Slow ORDER BY with LIMIT Solutions

My guess is that this would fix your query:

SELECT * 
FROM   location 
WHERE     to_tsvector('simple',unaccent2(city))
       @@ to_tsquery('simple',unaccent2('wroclaw')) 
ORDER  BY to_tsvector('simple',unaccent2(city))
       @@ to_tsquery('simple',unaccent2('wroclaw')) DESC
         ,displaycount 
LIMIT  20;

I repeat the WHERE condition as first element of the ORDER BY clause - which is logically redundant, but should keep the query planner from assuming it would be better off processing rows according to the index location_displaycount_index - which turns out to be much more expensive.

The underlying problem is that the query planner obviously grossly misjudges the selectivity and / or cost of your WHERE condition. I can only speculate why that is.

Do you have autovacuum running - which should also take care of running ANALYZE on your tables? Thereby, are your table-statistics up-to-date? Any effect if you run:

ANALYZE location;

And try again?

It can also be that the selectivity of the @@ operator is being misjudged. I would imagine that it is very hard to estimate for logical reasons.

If my query should not solve the problem, and generally to verify the underlying theory do one of these two things:

temporarily delete the index location_displaycount_index
temporarily disable basic index scans by running:
```
SET enable_indexscan = OFF;
```

The latter is less intrusive and only affects the current session. It leaves the methods bitmap heap scan and bitmap index scan open, which are used by the faster plan.
Then re-run the query.

BTW: If the theory is sound, your query (as you have it now) will be much faster with a less selective search term in the FTS condition - contrary to what you might expect. Try it.

Postgresql – How does postgres use the schema search path when making indexes

Given the following setup:

regress=> CREATE SCHEMA A;
CREATE SCHEMA
regress=> CREATE SCHEMA B;
CREATE SCHEMA
regress=> SET search_path = B, public;
SET
regress=> CREATE TABLE bar(email text);
CREATE TABLE
CREATE UNIQUE INDEX index_bar_on_email ON bar USING btree (email);
CREATE INDEX

I cannot reproduce the problem you report in PostgreSQL 9.2:

regress=> SET search_path = A, B;
SET
regress=> CREATE TABLE bar(email text);
CREATE TABLE
regress=> CREATE UNIQUE INDEX index_bar_on_email ON bar USING btree (email);
CREATE INDEX

However, rather than using the search_path, it's safer to use explicit schema-qualification. For example, I'd re-write the above as:

regress=> RESET search_path;
RESET
regress=> SHOW search_path ;
  search_path   
----------------
 "$user",public
(1 row)

CREATE TABLE B.bar(email text);
CREATE UNIQUE INDEX b.index_bar_on_email ON b.bar USING btree (email);
CREATE TABLE A.bar(email text);
CREATE UNIQUE INDEX index_bar_on_email ON A.bar USING btree (email);

The indexes are automatically created in the schema of their associated table; see:

regress=> \di B.
                  List of relations
 Schema |        Name        | Type  | Owner | Table 
--------+--------------------+-------+-------+-------
 b      | index_bar_on_email | index | craig | bar
(1 row)

regress=> \di A.
                  List of relations
 Schema |        Name        | Type  | Owner | Table 
--------+--------------------+-------+-------+-------
 a      | index_bar_on_email | index | craig | bar
(1 row)

Update based on question change:

Yes, what you've shown does look like a Rails adapter issue. It's checking to see whether the index exists in any schema. It should be checking to see whether the first table of the given name in the search_path has the named index.

I would write the query differently. I'd leave off the join on pg_class entirely, instead using a cast to regclass to handle search_path resolution for me. I'd use the resulting oid to search for the index. Compare original, then updated, below. Note that the updated query does require search_path to be set first.

regress=> SELECT DISTINCT i.relname, d.indisunique, d.indkey, t.oid, am.amname
 FROM pg_class t, pg_class i, pg_index d, pg_am am
 WHERE i.relkind = 'i'
 AND d.indexrelid = i.oid
 AND d.indisprimary = 'f'
 AND t.oid = d.indrelid
 AND t.relname = 'bar'
 AND i.relnamespace IN (SELECT oid FROM pg_namespace WHERE nspname IN ('b','a') )
 AND i.relam = am.oid
 ORDER BY i.relname
;
      relname       | indisunique | indkey |  oid  | amname 
--------------------+-------------+--------+-------+--------
 index_bar_on_email | t           | 1      | 28585 | btree
 index_bar_on_email | t           | 1      | 28592 | btree
(2 rows)

regress=> SELECT DISTINCT i.relname, d.indisunique, d.indkey, 'bar'::regclass::oid, am.amname
 FROM pg_class i, pg_index d, pg_am am
 WHERE i.relkind = 'i'
 AND d.indexrelid = i.oid
 AND d.indisprimary = 'f'
 AND 'bar'::regclass = d.indrelid
 AND i.relam = am.oid
 ORDER BY i.relname                                                              
;
      relname       | indisunique | indkey | regclass | amname 
--------------------+-------------+--------+----------+--------
 index_bar_on_email | t           | 1      | 28585    | btree
(1 row)

Best Answer

Related Solutions

PostgreSQL – Slow ORDER BY with LIMIT Solutions

Postgresql – How does postgres use the schema search path when making indexes

Related Question