Postgresql – Perf of query of large table by date

performancepostgresql

I have a table with some sensor data in it which grows every minute with a new measurement for that specific sensor. Something like:

entry_id (pk), sensor_id (fk), measurement_value, created_at

I need to perform some analytics on this data, at hour intervals. Basically the query will look something like:

SELECT measurement_value FROM above_table
WHERE sensor_id = x
AND created_at BETWEEN time_a AND (time_a + 1 hour)

1) First of all, I'm wondering if this design is reasonable long term. Each sensor will rack up 500k readings x year. If I have enough sensors, the table will get real sizeable real fast. I'd like analytics to be pretty snappy and not take longer than a second, so perhaps I'll need to be pre-computing these results every few insertions somehow. Alternatively I perhaps be looking for NoSQL solutions, or would that not really address the core issue?

2) What's the best performance I can get for querying that table? I'm thinking that an index on (sensor_id, created_at) would be hard to beat.

Thanks!

Best Answer

Here's my suggestion, which is somewhat different. I call it "log, aggregate, and snapshot."

Basically all your sensor input is coming in. We can assume or even require that it gets into the database within a specified period of time (1 day?). At the close of each interval we go back one interval into the past and snapshot our analytic info. In this way we can start with a snapshot before where we need and roll forward.

For example we might do this in a financial app (not really the same thing but poses similar challenges over time):

CREATE TABLE gl ( 
   id bigserial not null unique,
   reference text primary key
   description text,
   date_posted date not null
);
CREATE TABLE gl_lines (
   entry_id bigserial,
   account_id int not null references account(id),
   gl_id bigint not null references gl(id),
   amount numeric
);
CREATE TABLE eoy_checkpoint (
   date_ending date,
   account_id int references account(id)
   running_balance numeric not null,
   debits numeric,
   credits numeric
   primary key (date_ending, account_id)
);

Then we can maintain eoy_checkpoint when we close our books. We can also have a trigger that denies entry into gl_lines where it occurs on or before the most recent date in eoy_checkpoint. This allows us to still do aggregate reporting, but we can use the checkpoint as a point we can roll forwards from, and thus manage how much data is actually being aggregated.

I think a similar approach would be usable in your case. It handles this sort of thing very well, and allows a mixture of OLTP and more complex queries to be run without too much impact. It isn't really clear how normalized this is since there is duplication of data, but the data duplication allows you to do things like purche historical data without impacting your current running totals.

Related Solutions

Advice on choosing a really fast DBMS

If you can't scale a major RDBMS then your database design (includes indexing, queries and the like) or hardware is wrong. The choice of platform is almost irrelevant.

It is that simple. Especially when you mention "few hundred megabytes" which implies low volumes (I mean a few dozen writes per second)

Postgresql – query discrepancies on very large postgres table

The output from EXPLAIN ANALYZE is very helpful for understanding this. In particular, note these two lines:

->  Index Scan Backward using logdatetime on tlog
    (cost=0.00..2434879.14 rows=69168 width=8)
    (actual time=1924.800..1924.800 rows=1 loops=1)

->  Index Scan Backward using logdatetime on tlog
    (cost=0.00..2434879.14 rows=1688018 width=8)
    (actual time=32.513..32.513 rows=1 loops=1)

I reformatted and added bold emphasis.

Since you only seem to have an index on logdatetime that's what the query planner is going to use. It seems promising that you only want a single row, with MAX(logdatetime), so the query planner thinks it's a good idea to start from the back of the index (Index Scan Backward) and scan until the first row with matching pointid pops up.

This strategy works out just nicely for pointid = '359710040139645', but not so much for pointid = 'and.andrew'. The maximum logdatetime for this particular pointid is obviously rather old, so a lot of rows (1.688.018) have to be read before the first hit (which is the youngest row).

But still a lot better than a full table scan where you would need to read all 25m rows to identify the matching ones, then find the max among these. (Try deleting the index an rerun the query. Be prepared for a long wait.)

If you can afford adding indexes (some cost to write operations and considerable disk space for 25m rows) try adding an index on pointid instead, or better yet, this multi-column index:

CREATE INDEX tlog_pointid_logdatetime ON tlog (pointid, logdatetime DESC);

This would make your query fly.

If you don't want indexes to slow down your writes, you might create this index just for the purpose of running a couple of queries and delete it again. Index creation is pretty fast in PostgreSQL. This may pay even for just a few queries.

Best Answer

Related Solutions

Advice on choosing a really fast DBMS

Postgresql – query discrepancies on very large postgres table

Related Question