PostgreSQL to TimescaleDB – How to Convert Table to Hypertable

postgresqltimescaledb

I have a PostgreSQL table which I am trying to convert to a TimescaleDB hypertable.

The table looks as follows:

CREATE TABLE public.data
(
    event_time timestamp with time zone NOT NULL,
    pair_id integer NOT NULL,
    entry_id bigint NOT NULL,
    event_data int NOT NULL,
    CONSTRAINT con1 UNIQUE (pair_id, entry_id ),
    CONSTRAINT pair_id_fkey FOREIGN KEY (pair_id)
        REFERENCES public.pairs (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)

When I attempt to convert this table to a TimescaleDB hypertable using the following command:

SELECT create_hypertable(
        'data',
        'event_time',
        chunk_time_interval => INTERVAL '1 hour',
        migrate_data => TRUE
);

I get the Error: ERROR: cannot create a unique index without the column "event_time" (used in partitioning)

Question 1: From this post https://stackoverflow.com/questions/55312365/how-to-convert-a-simple-postgresql-table-to-hypertable-or-timescale-db-table-usi my understanding is that this is because I have specified a unique constraint (con1) which does not contain the column I am partitioning by – event_time. Is that correct?

Question 2: How should I change my table or hypertable to be able to convert this? I have added some data on how I plan to use the data and the structure of the data bellow.

Data Properties and usage:

There can be multiple entries with the same event_time – those entries would have entry_id's which are in sequence
- This means that if I have 2 entries (event_time 2021-05-18::10:16, id 105, <some_data>) and (event_time 2021-05-18::10:16, id 107, <some_data>) then the entry with id 106 would also have event_time 2021-05-18::10:16
The entry_id is not generated by me and I use the unique constraint con1 to ensure that I am not inserting duplicate data
I will query the data mainly on event_time e.g. to create plots and perform other analysis
At this point the database contains around 4.6 Billion rows but should contain many more soon
I would like to take advantage of TimescaleDB's speed and good compression
I don't care too much about insert performance

Solutions I have been considering:

Pack all the events which have the same timestamp in to an array somehow and keep them in one row. I think this would have downsides on compression and provide less flexibility on querying the data. Also I would probably end up having to unpack the data on each query.
Remove the unique constraint con1 – then how do I ensure that I don't add the same row twice?
Expand unique constraint con1 to include event_time – would that not somehow decrease performance while at the same time open up for the error where I accidentally insert 2 rows with entry_id and pair_id but different event_time? (I doubt this is a likely thing to happen though)

Best Answer

A couple questions as well as answers here:

it's the unique constraint (con1) that's the problem not the foreign key constraint but that was probably a typo.
Yes, it's a problem because the time partitioning column isn't included, the first question I have is, the entry_id seems like it might be a proxy for time from the other system, can you just use pair_id, time as your unique constraint? Will there be multiple records with different entry_ids same pair_ids and same times? You could probably run a query with a count(*) grouped by pair_id, time to figure this out... Note that a three column unique constraint could work, it's a question of how the time is assigned, is it assigned by you at write time or by the upstream system? If it's the upstream system then I wouldn't worry too much about duplicates.
the array thing is somewhat similar to compression, but the compression does different, more efficient algorithms than simple array packing.
figuring out a unique constraint that works, either on pair_id, time or pair_id, entry_id, time will probably be much more efficient than trying to reduce your data to the unique set at query time. And a three column unique constraint isn't going to have a huge impact on performance. The ordering of those columns should be chosen based on query patterns, what sorts of where clauses do you usually have? ie on pair id or something like that as well as time? or just by time?
finally, given the size of your table, I'd highly recommend creating the hypertable without migrating data, so create a separate table, make it a hypertable, then make a small job to migrate the data. The migrate_data = true is really meant for much smaller tables, having an outside job do it means you could also introduce some parallelism which will make it go much faster.

Best Answer

Related Solutions

PostgreSQL – Reordering a Timescaledb Hypertable

Postgresql – Continuously Adding SSDs to an existing PostgreSQL Server

Related Question