Postgresql – Measure the size of a PostgreSQL table row

disk-spaceperformancepostgresqlpostgresql-performancesize;

I have a PostgreSQL table. select * is very slow whereas select id is nice and quick. I think it may be that the size of the row is very large and it's taking a while to transport, or it may be some other factor.

I need all of the fields (or nearly all of them), so selecting just a subset isn't a quick fix. Selecting the fields that I want is still slow.

Here's my table schema minus the names:

integer                  | not null default nextval('core_page_id_seq'::regclass)
character varying(255)   | not null
character varying(64)    | not null
text                     | default '{}'::text
character varying(255)   | 
integer                  | not null default 0
text                     | default '{}'::text
text                     | 
timestamp with time zone | 
integer                  | 
timestamp with time zone | 
integer                  |

The size of the text field may be any size. But still, no more than a few kilobytes in the worst case.

Questions

Is there anything about this that screams 'crazy inefficient'?
Is there a way to measure page size at the Postgres command-line to help me debug this?

Best Answer

Q2: `way to measure page size`

PostgreSQL provides a number of Database Object Size Functions. I packed the most interesting ones in this query and added some Statistics Access Functions at the bottom. (The additional module pgstattuple provides more useful functions, yet.)

This is going to show that different methods to measure the "size of a row" lead to very different results. It all depends on what you want to measure, exactly.

This query requires Postgres 9.3 or later. For older versions see below.

Using a VALUES expression in a LATERAL subquery, to avoid spelling out calculations for every row.

Replace public.tbl with your optionally schema-qualified table name to get a compact view of collected row size statistics. You could wrap this into a plpgsql function for repeated use, hand in the table name as parameter and use EXECUTE ...

SELECT l.metric, l.nr AS bytes
     , CASE WHEN is_size THEN pg_size_pretty(nr) END AS bytes_pretty
     , CASE WHEN is_size THEN nr / NULLIF(x.ct, 0) END AS bytes_per_row
FROM  (
   SELECT min(tableoid)        AS tbl      -- = 'public.tbl'::regclass::oid
        , count(*)             AS ct
        , sum(length(t::text)) AS txt_len  -- length in characters
   FROM   public.tbl t                     -- provide table name *once*
   ) x
CROSS  JOIN LATERAL (
   VALUES
     (true , 'core_relation_size'               , pg_relation_size(tbl))
   , (true , 'visibility_map'                   , pg_relation_size(tbl, 'vm'))
   , (true , 'free_space_map'                   , pg_relation_size(tbl, 'fsm'))
   , (true , 'table_size_incl_toast'            , pg_table_size(tbl))
   , (true , 'indexes_size'                     , pg_indexes_size(tbl))
   , (true , 'total_size_incl_toast_and_indexes', pg_total_relation_size(tbl))
   , (true , 'live_rows_in_text_representation' , txt_len)
   , (false, '------------------------------'   , NULL)
   , (false, 'row_count'                        , ct)
   , (false, 'live_tuples'                      , pg_stat_get_live_tuples(tbl))
   , (false, 'dead_tuples'                      , pg_stat_get_dead_tuples(tbl))
   ) l(is_size, metric, nr);

Result:

              metric               | bytes    | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 44138496 | 42 MB        |            91
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    32768 | 32 kB        |             0
 table_size_incl_toast             | 44179456 | 42 MB        |            91
 indexes_size                      | 33128448 | 32 MB        |            68
 total_size_incl_toast_and_indexes | 77307904 | 74 MB        |           159
 live_rows_in_text_representation  | 29987360 | 29 MB        |            62
 ------------------------------    |          |              |
 row_count                         |   483424 |              |
 live_tuples                       |   483424 |              |
 dead_tuples                       |     2677 |              |

For older versions (Postgres 9.2 or older):

WITH x AS (
   SELECT count(*)               AS ct
        , sum(length(t::text))   AS txt_len  -- length in characters
        , 'public.tbl'::regclass AS tbl      -- provide table name as string
   FROM   public.tbl t                       -- provide table name as name
   ), y AS (
   SELECT ARRAY [pg_relation_size(tbl)
               , pg_relation_size(tbl, 'vm')
               , pg_relation_size(tbl, 'fsm')
               , pg_table_size(tbl)
               , pg_indexes_size(tbl)
               , pg_total_relation_size(tbl)
               , txt_len
             ] AS val
        , ARRAY ['core_relation_size'
               , 'visibility_map'
               , 'free_space_map'
               , 'table_size_incl_toast'
               , 'indexes_size'
               , 'total_size_incl_toast_and_indexes'
               , 'live_rows_in_text_representation'
             ] AS name
   FROM   x
   )
SELECT unnest(name)                AS metric
     , unnest(val)                 AS bytes
     , pg_size_pretty(unnest(val)) AS bytes_pretty
     , unnest(val) / NULLIF(ct, 0) AS bytes_per_row
FROM   x, y

UNION ALL SELECT '------------------------------', NULL, NULL, NULL
UNION ALL SELECT 'row_count', ct, NULL, NULL FROM x
UNION ALL SELECT 'live_tuples', pg_stat_get_live_tuples(tbl), NULL, NULL FROM x
UNION ALL SELECT 'dead_tuples', pg_stat_get_dead_tuples(tbl), NULL, NULL FROM x;

Same result.

Q1: `anything inefficient?`

You could optimize column order to save some bytes per row, currently wasted to alignment padding:

integer                  | not null default nextval('core_page_id_seq'::regclass)
integer                  | not null default 0
character varying(255)   | not null
character varying(64)    | not null
text                     | default '{}'::text
character varying(255)   | 
text                     | default '{}'::text
text                     |
timestamp with time zone |
timestamp with time zone |
integer                  |
integer                  |

This saves between 8 and 18 bytes per row. I call it Column Tetris. See:

Also consider:

Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

`DISTINCT ON()`

Just as a side note, this is precisely what DISTINCT ON() does (not to be confused with DISTINCT)

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

So if you were to write,

SELECT myFirstAgg(z)
FROM foo
GROUP BY x,y;

It's effectively

SELECT DISTINCT ON(x,y) z
FROM foo;
-- ORDER BY z;

In that it takes the first z. There are two important differences,

You can also select other columns at no cost of further aggregation..

SELECT DISTINCT ON(x,y) z, k, r, t, v
FROM foo;
-- ORDER BY z, k, r, t, v;

Because there is no GROUP BY you can not use (real) aggregates with it.

CREATE TABLE foo AS
SELECT * FROM ( VALUES
  (1,2,3),
  (1,2,4),
  (1,2,5)
) AS t(x,y,z);

SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- fails, as you should expect.
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- would not otherwise fail.
SELECT myFirstAgg(z), sum(z)
FROM foo
GROUP BY x,y;

Don't forget `ORDER BY`

Also, while I didn't bold it then I will now

Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

Always use an ORDER BY with DISTINCT ON

Using an Ordered-Set Aggregate Function

I imagine a lot of people are looking for first_value, Ordered-Set Aggregate Functions. Just wanted to throw that out there. It would look like this, if the function existed:

SELECT a, b, first_value() WITHIN GROUP (ORDER BY z)    
FROM foo
GROUP BY a,b;

But, alas you can do this.

SELECT a, b, percentile_disc(0) WITHIN GROUP (ORDER BY z)   
FROM foo
GROUP BY a,b;

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

Questions

Best Answer

Q2: way to measure page size

Q1: anything inefficient?

Related Solutions

Postgresql – In PostgreSQL, is there a type-safe first() aggregate function

DISTINCT ON()

Don't forget ORDER BY