Postgresql – How to speed up counting and filtering on table with a lot of columns and filter conditions

countindexpostgresqlpostgresql-performancewhere

I have TABLE_A in my database that has a lot of columns.

Let's say there are 100 columns: COLUMN_1, COLUMN_2, COLUMN_3, .. COLUMN_100

There are so many columns not because of denormalized design, but because each row represent Entity with a lot of properties (in this example – with 99 properties, first column is just an id)

My Application should do the following tasks:

Receive a set of filter conditions from the user
Execute a query in the database that filters the data on given conditions and then should count the result rows

There are no limits on the number of filter conditions:

there may be no filter conditions
there may be filter conditions on every column of the table
there may be filter conditions on some columns of the table (for example on 50 columns out of all)

So below is an example query that my application executes:

SELECT 
    COUNT(*)
FROM 
    TABLE_A 
WHERE 
    COLUMN_1 = 'test'
    AND COLUMN_2 != 'q'
    AND COLUMN_45 > 5
    AND COLUMN_45 < 511
    AND COLUMN_92 LIKE '%ddd%'
    AND COLUMN_98 > 1000

TABLE_A doesn't have any indexes – each column can be used for filtering and there are no sets of queries that are run much more frequently than the rest.

I also don't use any kind of cache on any level: insert and update operations happen not very often but more often than at least 2 query with the same filter conditions occur.

So in case of every query the sequential search is executed. It was not a problem earlier, but now the run-times of the query became unacceptable (number of rows in the table increased a lot).

So..

Is there any solution on how to speed up the execution of the task?

Any suggestions would be appreciated – I can change anything except one thing – I should use SQL. Currently I use PostgreSQL, but I would change DBMS if it will help.

Best Answer

I feel your pain, nik. We have a similar use case here, and have been struggling through as best we can with a metric ton of indexes.

I know you said you need to stick with SQL, but I really think you might want to consider a column store (https://en.wikipedia.org/wiki/Column-oriented_DBMS), perhaps one SQL-like enough for you to still find workable.

MariaDB seems to be embracing this now too: https://mariadb.com/resources/blog/why-columnstore-important

Related Solutions

Postgresql – Query for a table with paging and filtering vs. CTE (common table expression)

We have a similar issue with CTEs. From what I gather researching the question, and from actually testing on our own queries, indexes which would have been used to filter the results in the CTE when used in a WHERE clause outside of the CTE aren't used because, as mentioned here, the CTE acts as an optimization fence. This means that, for performance reasons, you will want to refactor queries using CTEs to use subqueries instead.

We had a bunch of queries that used CTEs where we gained an order or two of magnitude of performance when we refactored them to subqueries — in one case, we dropped query time from approx. 2 minutes to just under a second. So keep that in mind when building queries.

So, in your example, you would use the query inside the CTE as a subquery instead:

SELECT * FROM (
    SELECT
      customer.name,
      customer.surname,
      (SELECT aggregate_whatever(foo) FROM bar) past_purchases,
      baz.aaa bbb,
      gaz.ccc ddd,
      ...
    FROM 
      customers customer
    LEFT JOIN
      baz ON ...
    LEFT JOIN
      gaz ON ...
) encapsulated
WHERE
  past_purchases = 5
  AND <other conditions>

(In our case, we had relatively abysmal performance on some of our queries due to us using XPath column expressions, combined with scanning upwards of 30k rows in a table, caused us to spend seconds on calculating those XPaths on rows which would ultimately be discarded anyway. Removing the CTEs and using subqueries sped things up considerably, as the XPath columns would be calculated only for the actual rows returned).

MySQL – Multiple WHERE Conditions on Same Column

Your WHERE clause does not make much sense, because it is applied to each row individually and it is pointless to check if the same value is equal to a and at the same time not equal to b or c – of course, it will not be equal to b or c if it is a. What you want instead, therefore, is for the conditions to be applied to a group of rows as a whole – more specifically, to each group of rows sharing the same txn_id.

So, you need to use GROUP BY and, to apply conditions to groups of rows, HAVING. This query will give you the list of txn_id values matching your requirements:

SELECT
    t.txn_id
FROM
    1_txns AS t
    INNER JOIN 2_products AS p USING (sku)
GROUP BY
    t.txn_id
HAVING
    COUNT(p.rk_group = :group1 OR NULL) > 0
    AND COUNT(p.rk_group IN (:group2, group3) OR NULL) = 0
;

As you do not seem to want the list, only the number of its items, use the above as a derived table to count the rows:

SELECT
   COUNT(*)
FROM
(
    SELECT
        txn_id
    FROM
        1_txns AS t
        INNER JOIN 2_products AS p USING (sku)
    GROUP BY
        t.txn_id
    HAVING
        COUNT(rk_group = :group1 OR NULL) > 0
        AND COUNT(rk_group IN (:group2, group3) OR NULL) = 0
) AS s
;

As you can see, COUNT(DISTINCT ...) is not necessary: the derived table is grouping by txn_id and thus cannot return duplicates – so, COUNT(*) is enough to get the correct result.

In case you are not aware, the OR NULL bit lets the COUNT function count only matches and omit mismatches, as explained in detail in this answer:

Why do I need “OR NULL” in MySQL when counting rows with a condition

Best Answer

Related Solutions

Postgresql – Query for a table with paging and filtering vs. CTE (common table expression)

MySQL – Multiple WHERE Conditions on Same Column

Related Question