PostgreSQL – Performance Implications of Dropping Unused Surrogate Key Column

performancepostgresqlpostgresql-9.5postgresql-performance

I have a very large (>10m rows) table in a Postgres 9.5.4 database that collects readings from sensors. This table handles many inserts (~2k/minute) but isn't queried too often–only during a cache miss or when rebuilding the cache. Said table is essentially ‘append-only’–there are no updates. However, the queries that I do run are often fairly complex, and I care about optimizing read performance as cache misses can be very slow to a user.

In the business context of relevance, there are sensors that collect energy and power quality data from buildings; there can be multiple sensors for buildings, and a common query is aggregating facility level data. The table schema looks like so:

                                   Table "public.measurement"
    Column     |            Type             |                        Modifiers                         
---------------+-----------------------------+----------------------------------------------------------
 id            | integer                     | not null default nextval('measurement_id_seq'::regclass)
 meter_id      | integer                     | 
 timestamp     | timestamp without time zone | 
 value         | double precision            | 
 _reading_type | smallint                    | 
 facility_id   | integer                     | 
Indexes:
    "measurement_pkey" PRIMARY KEY, btree (id)
    "ix_measurement_timestamp_meter_id__reading_type" btree ("timestamp", meter_id, _reading_type)
    "ix_measurement_timestamp_facility_id__reading_type" btree ("timestamp", facility_id, _reading_type)
Foreign-key constraints:
    "measurement_meter_id_fkey" FOREIGN KEY (meter_id) REFERENCES meter(id)
    "measurement_facility_id_fkey" FOREIGN KEY (facility_id) REFERENCES facility(id)

I am considering dropping the id column, which is a meaningless ‘surrogate key’, as I never, ever query it, but I'm curious what the performance implications would be.

I essentially run two different types of queries on this table:

  • "Give me all of the measurements between time start and time end of _reading_type 0 for meter_id X", or
  • "Give me all of the measurements between time start and time end of _reading_type 0 for facility_id X" (I believe the facility_id column denormalizes the table, but saves what was an expensive join).

Questions

  • What would happen if I do not declare a primary ke?

  • The combination of (timestamp, meter_id, _reading_type) is guaranteed to be unique, so: what would be the implication of dropping the multicolumn index on these columns and defining a composite primary key?

Best Answer

What would happen if there is no primary key? The combination of timestamp-meter_id-_reading_type is guaranteed to be unique; what would be the implication of dropping the multicolumn index on these columns and making a composite primary key?

You don't need a surrogate key, but you can still use a PRIMARY KEY. It sounds to me like the Primary Key here is

meter_id, reading_id, timestamp

Also you have two of the same keys, drop one of them

"ix_measurement_timestamp_meter_id__reading_type"    btree ("timestamp", meter_id, _reading_type)
"ix_measurement_timestamp_facility_id__reading_type" btree ("timestamp", meter_id, _reading_type)

That's got to be a typo error, but you're just slowing down inserts. At 10 million rows, you may also want to look at partitioning in the upcoming 9.7 or inheritance if you can't upgrade.

This table handles many inserts (1k-2k/second)

You're also not likely doing 2,000 INSERTs a second. That would be 7.2 MILLION an hour. Unless your table gets truncated every 1.5 hours.