Postgresql – Difficult categorising of rows

postgresql

First off, I'm not a DBA at all, I'm more of a front-end guy, but I'm currently hitting a brick wall and can't figure out how to tackle it. A database I am working on has a table with several columns. The table represents completed tasks, the columns are:

TaskID             -  -  -  1. On-site
DateStart         |         2. Return-to-base
DateEnd           |         3. Chargeable
Type -  -  -  -  -          4. Complaint
Reference                   5. Rebook
Requester
Approver
Cost
...

Reference is used when the customer provides their own ID for the task, typically this is for example "THR#000123123" or "IX#01212". It is always in the format "ABC#123".

Now, I need to categorise these into multiple arbitrary categories and subcategories based on only the type and reference fields:

Chargeable (type = 3)
1. By company IX (reference like "IX#%")
2. All other Chargeable
Non-chargeable (type <> 3)
1. By company THR (reference like "THR#%")
  1. Return-to-base (type = 2)
  2. On-site (type = 1)
2. Return to base (type = 2)
3. All other non-Chargeable

A task can only belong to one category, so if a task for company THR was "return-to-base", it must only be under category 2.1.1, not 2.2. I realise these categories seem very arbitrary but this structure represents the most useful break-down of almost any selection of tasks from the table.

My question is, is it possible to create a view that contains the TaskID and its category? Or should I just implement this logic in the front-end after selecting every TaskID, Type and Reference?

Best Answer

Ok, so my answer here is conditioned on the following assumptions:

This categorization needs to be relatively stable and
It may be re-used in various ways, and
It needs to perform well.

The obvious, simple answer is to use table methods in PostgreSQL. What you do is create a SQL language function which does not hit the table and returns the value you want. Make it immutable so you can index the output if you want to query against it, etc. Note the specific rules aren't really clear from your post but this should get you started.

CREATE OR REPLACE FUNCTION subcategory(task) -- task is table name
RETURNS text LANGUAGE sql IMMUTABLE AS
$$
    SELECT CASE WHEN $1.type = 3 THEN
                CASE WHEN $I.reference LIKE 'IX#%' 
                     THEN 'Company IX Chargeable'
                     ELSE 'Other Chargeable'
                END
                WHEN $1.reference LIKE 'IX#%' THEN 
                CASE WHEN $1.type = 5 THEN 'Company IX Rebool'
                     ELSE 'Company IX Other'
                END
                WHEN $1.reference LIKE 'THR#%' AND $1.type IN (1, 2) THEN
                CASE WHEN $1.type = 2 THEN 'THR Return to Base'
                     WHEN $1.type = 1 THEN 'THR On Site'
                END
                WHEN $1.type = 2 THEN 'Return to Base'
                ELSE 'Other Nonchargeable'
           END;
$$;

You can then query this using:

SELECT t.subcategory FROM task t;

Note that the table name is non-optional here. The parser converts this to:

SELECT subcategory(t) FROM task t;

You can use this however in any part of the select statement including the where clause. If it proves slow, you can add cost estimates (but I think this should be fast), and you can even index the output using PostgreSQL's functional indexes.

On the performance side, LIKE and substring() don't seem to be significantly different:

postgres=# select count(*) from generate_series(1, 10000000);
  count   
----------
 10000000
(1 row)

Time: 9007.618 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE s::text like '1%';
  count  
---------
 1111112
(1 row)

Time: 13653.000 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE substring(s::text from 1 for 1) = '1';
  count  
---------
 1111112
(1 row)

Time: 16681.860 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE s::text like '1%';
  count  
---------
 1111112
(1 row)

Time: 17163.470 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE substring(s::text from 1 for 1) = '1';
  count  
---------
 1111112
(1 row)

Time: 17052.004 ms

So sometimes one is faster than the other, but they don't seem out of line with eachother.

Related Solutions

Postgresql – Select rows with repeat data within certain timeframe

SELECT *
FROM   task t1
WHERE  EXISTS (
   SELECT 1
   FROM   task t2
   WHERE  t2.reference = t1.reference  -- same reference
   AND    t2.task_id  <> t1.task_id    -- different task_id (exclude self-join)
   AND    t2.end_date BETWEEN t1.end_date - 30
                      AND     t1.end_date + 30 
   )

An EXISTS semi-join should be the fastest way to arrive at DISTINCT rows, because you don't get duplicates to begin with. Also, in the event of multiple matching rows (many matching tasks withing 30 days) evaluation can stop after the first match is found. I would expect this query to beat anything you have so far. Test with EXPLAIN ANALYZE.

In a WHERE clause or JOIN condition don't use an expression like:

abs(t1.end_date - t2.end_date) < 30

If you can avoid it. As the left hand expression is derived from values of two tables, the only way for the query planner is to form a limited CROSS JOIN (after applying other conditions) and compute a value for every possible combination. This is a well known anti-pattern for good performance!

As long as there are only few rows per reference, it won't matter much. But the cost grows with O(N²) with more rows per reference. I rewrote the condition to:

t2.end_date BETWEEN t1.end_date - 30 AND t1.end_date + 30

In my tests, the first form went from 5x slower to 500x slower quickly when I narrowed down the id-space for reference (-> more matching rows).

The second form can also more easily use indexes. If you want to select a small sample of the table, with conditions like:

reference > 450
date_end > now()::date

And an index like:

CREATE INDEX test_idx ON task (reference, date_end);
CREATE INDEX test_idx2 ON task (date_end);

The performance gap becomes even more overwhelming.

Postgresql: Return Unique Rows

select *
from the_table
where userid in (select userid
                 from the_table
                 group by userid
                 having count(*) = 1)

Alternatively:

select *
from the_table
   inner join (select userid
               from the_table
               group by userid
               having count(*) = 1) t on t.userid = the_table.userid;

or (possibly more efficient because only a single scan is needed):

select id, data, userid
from (
  select id, 
         data, 
         userid,
         count(*) over (partition by userid) as cnt
  from the_table
) t
where cnt = 1;

Best Answer

Related Solutions

Postgresql – Select rows with repeat data within certain timeframe

Postgresql: Return Unique Rows

Related Question