PostgreSQL Index Type for Not Equal (<>) WHERE Condition

indexperformancepostgresqlpostgresql-performancewhere

I need to query two huge tables (more than milion records). Besides other conditions, there is also a not equal (<>) where condition on a columns pair of varchar type. It has to exclude one in a million.

Values in the column are unique, same in both tables, so I have to exclude from joining rows from two tables which have identical values in that particular column.

Is a standard b-tree type index suitable, or would some other index type be more suitable for a not equal (<>) where condition?

What about pg-strom extension? Could pg-strom (utilizing GPU) enhance speed of a not equal comparison in the where clause?

Best Answer

What about pg-strom extension? Could pg-strom (utilizing GPU) enhance speed of a not equal comparison in the where clause?

There is no point in using the GPU unless the operations can make use of highly parallel processing. Simple <> is likely not in the use case. That said, Pg 9.6 should make use of simple-parallel index scans.

I need to query two huge tables (more than milion records). Besides other conditions, there is also a not equal (<>) where condition on a columns pair of varchar type. It has to exclude one in a million.

Indexes that exclude one in a million are likely not going to be planned to use an index. We would need a test case here or for you to show the actual query.

Just think of it like this.

Scenario one, you visit a million rows on disk. You need a 999,999 rows.
Scenario two, you visit a million rows in an index -- this isn't free. You need 999,999. After you determine which one row you do not want, you still have to visit the 999,999 rows on disk.

Which would you rather do? If you have to visit (almost) all the rows anyway the index doesn't save you any work.

Related Solutions

Postgresql – slow index scan on postgresql 9.1

Well, the solution was rephrasing NOT EXISTS to NOT IN. So the index scan and the whole query run quite fast as one would expect.

Postgresql – Optimize PostgreSQL server setting for extremely wide tables

How many of those columns to you use for grouping? If it's relatively few, then I would recommend restructuring the data to be in a long format, where each grouping (category) column is maintained, and is each grouped-by (metric) column is instead jammed into two columns variable and value, similar to how R's reshape2::melt function works. For instance, a table:

id|cat1|cat2|metric1|metric2|

Would become:

id|cat1|cat2|variable|value
id|cat1|cat2|metric1|value of metric 1 column
...
id|cat1|cat2|metric2|value of metric 2 column

The table would become K times longer, with K being the number of metrics you'd like to melt. This can actually improve query performance if you add indexes on your category columns.

If that doesn't speed up performance, then I'd recommend using a different tool than Postgres, such as Apache Spark.

Best Answer

Related Solutions

Postgresql – slow index scan on postgresql 9.1

Postgresql – Optimize PostgreSQL server setting for extremely wide tables

Related Question