Postgresql – Perf of query of large table by date

performancepostgresql

I have a table with some sensor data in it which grows every minute with a new measurement for that specific sensor. Something like:

entry_id (pk), sensor_id (fk), measurement_value, created_at

I need to perform some analytics on this data, at hour intervals. Basically the query will look something like:

SELECT measurement_value FROM above_table
WHERE sensor_id = x
AND created_at BETWEEN time_a AND (time_a + 1 hour)

1) First of all, I'm wondering if this design is reasonable long term. Each sensor will rack up 500k readings x year. If I have enough sensors, the table will get real sizeable real fast. I'd like analytics to be pretty snappy and not take longer than a second, so perhaps I'll need to be pre-computing these results every few insertions somehow. Alternatively I perhaps be looking for NoSQL solutions, or would that not really address the core issue?

2) What's the best performance I can get for querying that table? I'm thinking that an index on (sensor_id, created_at) would be hard to beat.

Thanks!

Best Answer

Here's my suggestion, which is somewhat different. I call it "log, aggregate, and snapshot."

Basically all your sensor input is coming in. We can assume or even require that it gets into the database within a specified period of time (1 day?). At the close of each interval we go back one interval into the past and snapshot our analytic info. In this way we can start with a snapshot before where we need and roll forward.

For example we might do this in a financial app (not really the same thing but poses similar challenges over time):

CREATE TABLE gl ( 
   id bigserial not null unique,
   reference text primary key
   description text,
   date_posted date not null
);
CREATE TABLE gl_lines (
   entry_id bigserial,
   account_id int not null references account(id),
   gl_id bigint not null references gl(id),
   amount numeric
);
CREATE TABLE eoy_checkpoint (
   date_ending date,
   account_id int references account(id)
   running_balance numeric not null,
   debits numeric,
   credits numeric
   primary key (date_ending, account_id)
);

Then we can maintain eoy_checkpoint when we close our books. We can also have a trigger that denies entry into gl_lines where it occurs on or before the most recent date in eoy_checkpoint. This allows us to still do aggregate reporting, but we can use the checkpoint as a point we can roll forwards from, and thus manage how much data is actually being aggregated.

I think a similar approach would be usable in your case. It handles this sort of thing very well, and allows a mixture of OLTP and more complex queries to be run without too much impact. It isn't really clear how normalized this is since there is duplication of data, but the data duplication allows you to do things like purche historical data without impacting your current running totals.