Postgresql – Does the fact a table is big can impact the overall performance of a PostgreSQL server

postgresqlpostgresql-performance

If I have a table that is getting bigger and bigger (i.e. it is taking more and more storage space – currently 65GB), can this affect the overall performance of the PostgreSQL server, e.g. impact the speed of queries on other tables?

This is for a PostgreSQL 9.6 database (we plan to upgrade to 10 -> 11 -> 12 later this year), hosted on Google Cloud (Cloud SQL for PostgreSQL).

Best Answer

No, this shouldn't affect the speed of queries for other tables.

One way it could indirectly affect the speed is, if you regularly run queries against that big table and retrieve many rows from there (either directly or indirectly e.g. because of a Seq Scan). This could then cause data from other (smaller) tables to be evicted from the cache.

Then a query using one of the other tables needs to retrieve the data again from the hard disk which of course is slower than from cache.

Original query:

EXPLAIN ANALYZE
with dd2 as (select * from daily_data where wh_calc_id = 241 AND company_id < 8 
) 
SELECT d.*
, ct, ct_lt
     , round(CASE WHEN x.ct = 0 THEN numeric '1'  
                  ELSE x.ct_lt / x.ct END, 6) AS pctl_calc
FROM   dd2 d, LATERAL (
   SELECT count(daily_val) AS ct
        , count(daily_val < d.daily_val OR NULL)::numeric As ct_lt
   FROM   dd2
   WHERE  company_id = d.company_id
   --   and       company_id < 8
   AND    trade_date < d.trade_date
   and    wh_calc_id = d.wh_calc_id
   -- and     wh_calc_id = 241

   ) x
ORDER  BY company_id, trade_date;

Subquery version:

with dd2 as (
        select company_id, trade_date
   , (SELECT count(daily_val) FROM daily_data WHERE company_id = d.company_id AND trade_date < d.trade_date and wh_calc_id = d.wh_calc_id) AS ct
   , (SELECT count(daily_val < d.daily_val OR NULL)::numeric FROM daily_data WHERE company_id = d.company_id AND trade_date < d.trade_date and wh_calc_id = d.wh_calc_id) AS ct_lt
        from daily_data AS d where wh_calc_id = 241 AND company_id < 8 
)
SELECT d.*
     , round(CASE WHEN d.ct = 0 THEN numeric '1'  
                  ELSE d.ct_lt / d.ct END, 6) AS pctl_calc
FROM   dd2 d
ORDER  BY company_id, trade_date;

Partial window + partial subquery version:#

EXPLAIN ANALYZE
with dd2 as (select *,
        count(daily_val) OVER (PARTITION BY company_id, wh_calc_id ORDER BY trade_date) - 1 AS ct,
        (SELECT count(daily_val < d.daily_val OR NULL)::numeric FROM daily_data WHERE company_id = d.company_id AND trade_date < d.trade_date and wh_calc_id = d.wh_calc_id) AS ct_lt
        from daily_data AS d where wh_calc_id = 241 AND company_id < 8 
) 
SELECT d.*
     , round(CASE WHEN ct = 0 THEN numeric '1'  
                  ELSE ct_lt / ct END, 6) AS pctl_calc
FROM   dd2 d
ORDER  BY company_id, trade_date;

Pure window version (but some tricky):

EXPLAIN ANALYZE
with dd2 as (select d.company_id, d.trade_date, d.daily_val,
        count(daily_val) OVER w - 1 AS ct,
        array_agg(daily_val) OVER w AS ct_lt2
        from daily_data AS d
        where wh_calc_id = 241 AND company_id < 8 
        WINDOW w AS (PARTITION BY company_id, wh_calc_id ORDER BY trade_date)
) , dd1 AS (  
SELECT d.*
, (SELECT count(*) FILTER (WHERE u < d.daily_val) FROM unnest(ct_lt2) AS u) AS ct_lt

FROM   dd2 d
) SELECT *
     , round(CASE WHEN ct = 0 THEN numeric '1'  
                  ELSE ct_lt / ct END, 6) AS pctl_calc
FROM dd1 AS d
ORDER  BY company_id, trade_date;

Original query:

                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2653.28..2653.84 rows=226 width=132) (actual time=8399.345..8399.886 rows=7540 loops=1)
   Sort Key: d.company_id, d.trade_date
   Sort Method: quicksort  Memory: 1253kB
   CTE dd2
     ->  Bitmap Heap Scan on daily_data  (cost=21.55..1224.03 rows=226 width=60) (actual time=1.026..4.794 rows=7540 loops=1)
           Recheck Cond: (wh_calc_id = 241)
           Filter: (company_id < 8)
           Rows Removed by Filter: 2526
           Heap Blocks: exact=1470
           ->  Bitmap Index Scan on daily_data_wh_calc_id_idx  (cost=0.00..21.50 rows=677 width=0) (actual time=0.848..0.848 rows=10066 loops=1)
                 Index Cond: (wh_calc_id = 241)
   ->  Nested Loop  (cost=6.22..1420.41 rows=226 width=132) (actual time=6.196..8391.961 rows=7540 loops=1)
         ->  CTE Scan on dd2 d  (cost=0.00..4.52 rows=226 width=60) (actual time=1.029..1.933 rows=7540 loops=1)
         ->  Aggregate  (cost=6.22..6.24 rows=1 width=40) (actual time=1.111..1.111 rows=1 loops=7540)
               ->  CTE Scan on dd2  (cost=0.00..6.22 rows=1 width=32) (actual time=0.003..0.995 rows=484 loops=7540)
                     Filter: ((trade_date < d.trade_date) AND (company_id = d.company_id) AND (wh_calc_id = d.wh_calc_id))
                     Rows Removed by Filter: 7056
 Planning time: 0.345 ms
 Execution time: 8400.233 ms
(19 rows)

Subquery version:

                                                                                              QUERY PLAN                                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=9949.69..9950.25 rows=226 width=80) (actual time=10558.505..10559.017 rows=7540 loops=1)
   Sort Key: d.company_id, d.trade_date
   Sort Method: quicksort  Memory: 782kB
   CTE dd2
     ->  Bitmap Heap Scan on daily_data d_1  (cost=21.55..9934.07 rows=226 width=48) (actual time=1.138..10538.523 rows=7540 loops=1)
           Recheck Cond: (wh_calc_id = 241)
           Filter: (company_id < 8)
           Rows Removed by Filter: 2526
           Heap Blocks: exact=1470
           ->  Bitmap Index Scan on daily_data_wh_calc_id_idx  (cost=0.00..21.50 rows=677 width=0) (actual time=0.945..0.945 rows=10066 loops=1)
                 Index Cond: (wh_calc_id = 241)
           SubPlan 1
             ->  Aggregate  (cost=19.26..19.27 rows=1 width=8) (actual time=0.722..0.722 rows=1 loops=7540)
                   ->  Index Scan using daily_data_company_id_trade_date_wh_calc_id_key on daily_data  (cost=0.42..19.26 rows=1 width=32) (actual time=0.007..0.671 rows=484 loops=7540)
                         Index Cond: ((company_id = d_1.company_id) AND (trade_date < d_1.trade_date) AND (wh_calc_id = d_1.wh_calc_id))
           SubPlan 2
             ->  Aggregate  (cost=19.26..19.27 rows=1 width=32) (actual time=0.672..0.672 rows=1 loops=7540)
                   ->  Index Scan using daily_data_company_id_trade_date_wh_calc_id_key on daily_data daily_data_1  (cost=0.42..19.26 rows=1 width=32) (actual time=0.006..0.551 rows=484 loops=7540)
                         Index Cond: ((company_id = d_1.company_id) AND (trade_date < d_1.trade_date) AND (wh_calc_id = d_1.wh_calc_id))
   ->  CTE Scan on dd2 d  (cost=0.00..6.78 rows=226 width=80) (actual time=1.142..10552.746 rows=7540 loops=1)
 Planning time: 0.248 ms
 Execution time: 10559.387 ms
(22 rows)

Partial window + partial subquery version:

                                                                                       QUERY PLAN                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=5609.72..5610.28 rows=226 width=132) (actual time=4931.963..4932.433 rows=7540 loops=1)
   Sort Key: d.company_id, d.trade_date
   Sort Method: quicksort  Memory: 1253kB
   CTE dd2
     ->  WindowAgg  (cost=1232.87..5594.10 rows=226 width=100) (actual time=9.498..4914.892 rows=7540 loops=1)
           ->  Sort  (cost=1232.87..1233.43 rows=226 width=60) (actual time=9.468..11.291 rows=7540 loops=1)
                 Sort Key: d_1.company_id, d_1.trade_date
                 Sort Method: quicksort  Memory: 782kB
                 ->  Bitmap Heap Scan on daily_data d_1  (cost=21.55..1224.03 rows=226 width=60) (actual time=1.002..5.251 rows=7540 loops=1)
                       Recheck Cond: (wh_calc_id = 241)
                       Filter: (company_id < 8)
                       Rows Removed by Filter: 2526
                       Heap Blocks: exact=1470
                       ->  Bitmap Index Scan on daily_data_wh_calc_id_idx  (cost=0.00..21.50 rows=677 width=0) (actual time=0.814..0.814 rows=10066 loops=1)
                             Index Cond: (wh_calc_id = 241)
           SubPlan 1
             ->  Aggregate  (cost=19.26..19.27 rows=1 width=32) (actual time=0.649..0.649 rows=1 loops=7540)
                   ->  Index Scan using daily_data_company_id_trade_date_wh_calc_id_key on daily_data  (cost=0.42..19.26 rows=1 width=32) (actual time=0.005..0.529 rows=484 loops=7540)
                         Index Cond: ((company_id = d_1.company_id) AND (trade_date < d_1.trade_date) AND (wh_calc_id = d_1.wh_calc_id))
   ->  CTE Scan on dd2 d  (cost=0.00..6.78 rows=226 width=132) (actual time=9.504..4928.264 rows=7540 loops=1)
 Planning time: 0.223 ms
 Execution time: 4932.854 ms
(22 rows)

Pure window version (but some tricky):

                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1601.04..1601.61 rows=226 width=120) (actual time=2006.537..2018.331 rows=7540 loops=1)
   Sort Key: d.company_id, d.trade_date
   Sort Method: quicksort  Memory: 63526kB
   CTE dd2
     ->  WindowAgg  (cost=1232.87..1239.08 rows=226 width=84) (actual time=8.298..86.164 rows=7540 loops=1)
           ->  Sort  (cost=1232.87..1233.43 rows=226 width=44) (actual time=8.285..9.752 rows=7540 loops=1)
                 Sort Key: d_1.company_id, d_1.trade_date
                 Sort Method: quicksort  Memory: 782kB
                 ->  Bitmap Heap Scan on daily_data d_1  (cost=21.55..1224.03 rows=226 width=44) (actual time=0.935..5.078 rows=7540 loops=1)
                       Recheck Cond: (wh_calc_id = 241)
                       Filter: (company_id < 8)
                       Rows Removed by Filter: 2526
                       Heap Blocks: exact=1470
                       ->  Bitmap Index Scan on daily_data_wh_calc_id_idx  (cost=0.00..21.50 rows=677 width=0) (actual time=0.722..0.722 rows=10066 loops=1)
                             Index Cond: (wh_calc_id = 241)
   CTE dd1
     ->  CTE Scan on dd2 d_2  (cost=0.00..346.34 rows=226 width=88) (actual time=8.315..1573.920 rows=7540 loops=1)
           SubPlan 2
             ->  Aggregate  (cost=1.50..1.51 rows=1 width=8) (actual time=0.155..0.156 rows=1 loops=7540)
                   ->  Function Scan on unnest u  (cost=0.00..1.00 rows=100 width=32) (actual time=0.034..0.058 rows=485 loops=7540)
   ->  CTE Scan on dd1 d  (cost=0.00..6.78 rows=226 width=120) (actual time=8.321..1774.450 rows=7540 loops=1)
 Planning time: 0.177 ms
 Execution time: 2443.855 ms
(23 rows)

Conclusion

Windows version is the best. I found no easy way to convert ct_lt calculations for it. As a workaround I used the aggregation of values in the array and count ct_lt from it. Probably, it can create some penalty for large amounts of data. But even with that, this version will be better than others.

UPD:

Adapting for 9.4:

EXPLAIN ANALYZE
with dd2 as (select d.company_id, d.trade_date, d.daily_val,
        count(daily_val) OVER w - 1 AS ct,
        array_agg(daily_val) OVER w AS ct_lt2
        from daily_data AS d
        where wh_calc_id = 241 AND company_id < 8 
        WINDOW w AS (PARTITION BY company_id, wh_calc_id ORDER BY trade_date)
) , dd1 AS (  
SELECT d.*
, (SELECT count(*) FROM unnest(ct_lt2) AS u WHERE u < d.daily_val) AS ct_lt

FROM   dd2 d
) SELECT *
     , round(CASE WHEN ct = 0 THEN numeric '1'  
                  ELSE ct_lt::numeric / ct END, 6) AS pctl_calc
FROM dd1 AS d
ORDER  BY company_id, trade_date;

Index:

CREATE INDEX daily_data_i1_idx
  ON daily_data
  (wh_calc_id, company_id, trade_date);
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1311.98..1312.82 rows=333 width=120) (actual time=1597.558..1613.481 rows=7407 loops=1)
   Sort Key: d.company_id, d.trade_date
   Sort Method: quicksort  Memory: 61659kB
   CTE dd2
     ->  WindowAgg  (cost=824.34..833.50 rows=333 width=84) (actual time=10.416..87.668 rows=7407 loops=1)
           ->  Sort  (cost=824.34..825.17 rows=333 width=44) (actual time=10.381..11.851 rows=7407 loops=1)
                 Sort Key: d_1.company_id, d_1.trade_date
                 Sort Method: quicksort  Memory: 771kB
                 ->  Bitmap Heap Scan on daily_data d_1  (cost=11.83..810.39 rows=333 width=44) (actual time=2.934..6.548 rows=7407 loops=1)
                       Recheck Cond: ((wh_calc_id = 241) AND (company_id < 8))
                       Heap Blocks: exact=1460
                       ->  Bitmap Index Scan on daily_data_i1_idx  (cost=0.00..11.75 rows=333 width=0) (actual time=2.661..2.661 rows=7407 loops=1)
                             Index Cond: ((wh_calc_id = 241) AND (company_id < 8))
   CTE dd1
     ->  CTE Scan on dd2 d_2  (cost=0.00..454.55 rows=333 width=88) (actual time=10.442..1320.450 rows=7407 loops=1)
           SubPlan 2
             ->  Aggregate  (cost=1.33..1.34 rows=1 width=8) (actual time=0.143..0.143 rows=1 loops=7407)
                   ->  Function Scan on unnest u  (cost=0.00..1.25 rows=33 width=0) (actual time=0.036..0.124 rows=234 loops=7407)
                         Filter: (u < d_2.daily_val)
                         Rows Removed by Filter: 243
   ->  CTE Scan on dd1 d  (cost=0.00..9.99 rows=333 width=120) (actual time=10.450..1468.533 rows=7407 loops=1)
 Planning time: 0.471 ms
 Execution time: 1960.835 ms

Postgresql – Use BIGINT and twice the storage space or INTEGER but with extra operations

Assumptions

agency has fewer rows than the "millions and billions" you mention for other tables. Way below the range of integer: -2147483648 to +2147483647. Else we need bigint for internal_id to begin with.
But agencyis still big. Else, don't bother with the index optimizations below.
Both internal_id and external_id hardly ever change.
ID values are roughly evenly distributed. Not a few extremely common agencies and many very rare ones. (That might favor query optimization without key translation.)

I would consider a combination of scenario 1 & 2, using this query style:

SELECT *
FROM   A
WHERE  internal_id = (SELECT internal_id FROM agency WHERE external_id=5);

The subquery encapsulates the key translation and can be used as drop-in replacement for providing a literal internal_id. Also makes the job of the query planner a bit simpler when involving many joins.

Unless you reuse internal_id for many subsequent queries, a separate lookup needlessly adds costs for a separate round trip to the server.

You might encapsulate the key translation in a simple SQL function:

CREATE FUNCTION public.f_ext2int(_external_id bigint)
 RETURNS int
 LANGUAGE sql STABLE PARALLEL SAFE AS
'(SELECT internal_id FROM public.agency WHERE external_id = _external_id)';

Then the above query becomes:

SELECT * FROM A WHERE internal_id = f_ext2int(5);

The function can be "inlined" by the query planner. See:

I suggest this table definition:

CREATE TABLE agency (
  internal_id integer          -- PK is NOT NULL implicitly
, external_id bigint NOT NULL  -- NOT NULL, right?
-- , name, location, created_at, ...
, PRIMARY KEY (internal_id) INCLUDE (external_id)
, UNIQUE      (external_id) INCLUDE (internal_id)
);

This provides the crucial indices on (internal_id, external_id) and (external_id, internal_id) and enforces the constraints you mentioned, without redundant indices.

The second one (UNIQUE (external_id) INCLUDE (internal_id)) is intended for reverse lookups. Seems likely you need that, too. Else you can skip the INCLUDE clause there. Why do we need both indices? See:

Is a composite index also good for queries on the first field?

It makes heavy use of covering indices (Postgres 11 or later). See:

Do covering indexes in PostgreSQL help JOIN columns?

Among other things, covering indices negate the ballast of additional columns in agency for the purpose of key translation.

With these indices in place, key translation burns down to very fast index-only scans for key translation. The cost will be practically negligible in the context of queries on your huge tables.

This saves "millions and billions" times 4 bytes for every additional table and index (which may matter a lot more). True, storage is getting cheaper all the time, but RAM (and fast cache memory!) is still typically limited. Bigger tables and indices means that less of it can stay in cache. And that's crucial for performance.

Wider rows always affect overall performance of the database more or less negatively, even with cheap storage. Related discussion:

What is retrieved from disk during a query?

And it's typically much easier on the human eye to operate with smaller integer numbers in the many tables (and log files, and debugging, ...). May even be the most important practical benefit.

Best Answer

Related Solutions

Postgresql – Speed up query calculation. Where can I add indexes or optimize the query or server

Original query:

Subquery version:

Partial window + partial subquery version:#

Pure window version (but some tricky):

Original query:

Subquery version:

Partial window + partial subquery version:

Pure window version (but some tricky):

Conclusion

UPD:

Postgresql – Use BIGINT and twice the storage space or INTEGER but with extra operations

Related Question