Postgresql – Optimizing table for timeseries Postgres data table

postgresqlpostgresql-performancetimescaledb

I have the below table which maintains a timeseries result. The row only becomes relevant when the signal is true, When signal is false, it just marks that for that particular timestamp we got a result but it is not a valid one, so the res and other columns just contains null values. When signal is null, it marks that we are yet to receive result for this timestamp. The signal is very sparse in nature, it is only true for maybe less than 7% of the records. Also the inserts made to this table are not ordered according to timestamp, older dates could arrive at later time.

CREATE TABLE public.res
(
 pid integer NOT NULL,
 aid integer NOT NULL,
 cid integer NOT NULL,
"time" timestamp without time zone NOT NULL,
 signal boolean,
 price numeric,
 res double precision[] NOT NULL,
 ...<Many more columns of numeric/numeric array data types>
 CONSTRAINT res_pkey PRIMARY KEY (pid, aid, cid, "time")
)

This table can contains millions of records and is growing exponentially as my database is growing. I want to optimize this table. So have following questions

Is each row the same size? Or since the row only makes sense if Signal is true, can it be dynamically sized? Hence keeping the overall size of the table low? A min row size would contain (pid,aid,cid,time,signal,price) and the max will additionally contain (res and the remaining columns)? Is it possible to this in Postgres with its dataTypes? I do not want to create separate tables because run time joins could be very expensive when there are millions of records.
This SO answer says "In effect NULL storage is absolutely free for tables up to 8 columns." , I have many more columns
Any other suggestion that you might have to deal with such problems?
I read about timescaleDB, but since the records do not get inserted in order of timestamps does it has any advantage over Postgres in this usecase?

Thanks

Best Answer

If you were seriously considering switching database systems, I could tell you Microsoft SQL Server already has an out of the box feature for sparse columns that fits your use case we'll. But my recommendation would be to not look to change database systems only just to optimize data storage. PostgreSQL is a very capable database system itself.

To answer your question, NULL values only take up 1 bit of data regardless of how many columns there are, and therefore is very lightweight so it reflects your sparseness accurately.

Related Solutions

Postgresql – Postgres – Optimizing an view dependent on an aggregate function

You can hugely simplify the query using DISTINCT ON:

SELECT DISTINCT ON (client, practice, account_id, encounter_id, charge_id)
       id, charge_id, post_date
FROM   charges
ORDER  BY client, practice, account_id, encounter_id, charge_id, post_date DESC, id DESC;

Will be considerably faster in any case. Detailed explanation in this related answer on SO:
Select first row in each GROUP BY group?

The last ORDER BY expression id DESC is optional to break ties if rest should not be unambiguous, yet. May not be needed.

Support this with a matching multicolumn index:

CREATE INDEX charges_latest_idx ON charges
(client, practice, account_id, encounter_id, charge_id, post_date DESC, id DESC);

Whether such an index will be useful depends on undisclosed details.

Note in particular, that sort order has to match the query. In Postgres 9.2 or later this may even function as covering index, depending on undisclosed details.

Depending on undisclosed details, a materialized view might be a candidate, too. The more write operations the smaller the likelihood this would help. The same goes for the covering index.

Postgresql – Compound Index on 43 Million PostgreSQL table

In a GiST index, the order of columns has a different significance than in a B-tree index. Per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

In short: put the most selective columns first.

Your EXPLAIN output shows that the condition on pid is more selective (rows=7836) than the one on outline (rows=63112). If that can be generalized (a single example may be misleading) I suggest this alternative:

CREATE INDEX inventory_compound_idx ON portal.inventory USING gist (pid, outline);

If most of your (important) queries include conditions on both columns, a multicolumn index may serve you well. Else, single columns may be better overall.

Table layout

This is an educated guess since I don't have complete information.

Don't use oid as column name. It's easy to confuse with the OID.
Don't use the name date for a timestamp column. Or rather: don't use the name date for any column, don't use names of base-types for identifiers at all. Can lead to confusing mistakes and error messages.
Create a lookup table for types and only put a small integer type_id into the big table. Pack fixed-length types tightly so not to waste space to padding. Details.
I prefer the type text (or varchar without length limit) over varchar(n). Details.

For example:

CREATE TABLE portal.inventory (
   inventory_id bigint PRIMARY KEY
  ,type_id      integer NOT NULL REFERENCES inv_type(type_id)
  ,pid          integer NOT NULL
  ,size         bigint NOT NULL
  ,ts           timestamp NOT NULL
  ,outline      geography(Polygon,4326)
  ,product_name text
  ,path         text
);

Best Answer

Related Solutions

Postgresql – Postgres – Optimizing an view dependent on an aggregate function

Postgresql – Compound Index on 43 Million PostgreSQL table

Table layout

Related Question