Postgresql – Should I split timestamp parts into separate columns

database-designoptimizationperformancepostgresqlquery-performancetimestamp

I am building a PostgreSQL database and I have created a timestamp table, where the primary key is the timestamp itself (e.g. id: Fri Apr 13 2018 15:00:19). The database is supposed to be later migrated to a data warehouse, from which analytics will be extracted.

At this point, I am wondering whether it is beneficial to add extra columns to the timestamp table, containing the parsed metrics such as the example below, or have a single table with the ID's.

id                       | year | month | day | hour | minutes | seconds
-------------------------------------------------------------------------
Fri Apr 13 2018 15:00:19 | 2018 |   4   | 13  |  15  |    0    |   19


vs


id
-------------------------
Fri Apr 13 2018 15:00:19

My goal is to achieve the best performance possible when querying the data warehouse, so I'm assuming having the timestamp split accordingly will result in faster queries rather than unzipping time metrics in real-time:

SELECT * FROM timestamp_table WHERE year = 2018 /* Querying values already parsed */

vs

SELECT * FROM timestamp_table WHERE YEAR(timestamp_id) = 2018 /* Parsing in real-time*/

I would appreciate some best practices input on this.

Best Answer

Keep the timestamp and don't add columns for the parts.

If you need to search for part of a timestamp, you can always create indexes on extract expressions.

Having individual columns wastes space and adds undesirable redundancy for no benefit I can envision.

Related Solutions

Should we add extra 5 columns or build a separate table

The decision does not depend on the number of columns you already have, but on whether those 5 columns do belong to what you are trying to model.

There is no rule that says: "if you have 10 columns, keep in same table; use a separate table when you reach 36 columns".

If those 5 columns are like a person's street, town, region, country, then no; they should be on a separate table "location" or similar. If those 5 columns are like a person's 1st name, last name, 2nd last name, ID-number, and email address, then they should probably be there.

However, if your table already has 35 columns, then you probably haven't done much normalization to begin with and it looks like you're just stuffing everything on that table...

Update

first we have a table called association then based on that table we have the gps data linked to the assocition. In addition we also got alert tables linked to the association too. So association is the main table here. Each device also is table by itself but it will be linked to a association when and where it is being used.

…

Basically the association is just like a trip from a to b. So moving from a to b there is a number of gps data which is kept in this main table [?]. If in this trip there are alerts then is kept in the alert table. This 5 new column are indicating the status of the rfid devices which are linked to the main gps device. But in a trip not necessary to have 5 it can be less or even none too.

Assumptions:

For each trip there can be 0..N alerts
as I understand it, what you call association is the description of a journey
multiple RFID devices are used per journey or trip.
A GPS Device has 0..N RFID devices associated to it; in other words, each rfid device is associated to one GPS device (not sure I understand the logic of this, though; why are RFID devices used? what's the role of the GPS if you are )
Given that you are using a GPS device, I assume there will be multiple readings during a Journey, in combination with RFID readings (?)
When you say moving from a to b there is a number of gps data which is kept in this main table, I'm not sure I understand what info is stored, or why it is stored in the table called association / journey instead of in some associative table (given that there seems to be "a number of gps data" and not just "one gps datum per journey")

Given this, I would add your 5 columns to the RFIDStatus table, since This 5 new column are indicating the status of the rfid devices. Below is a first attempt at modeling something similar to what you might be modeling...

[     Journey    ]<——————[  Alert  ]
| fk_from        |       | fk_trip |
| fk_to          |
|                |—(from)—>[    Location    ]
|                |–—(to)—–>| long/lat/name… |
|                |
|                |<—————[    RFIDStatus    ]—————>[RFIDDevice]—————>[ GPSDevice ]
|                |      | fk_rfiddevice    |      | fk_gps   |      |           |
|                |      | fk_journey       |                        |           |
|                |      | datetime         |                        |           |
|                |      | *status details* |                        |           |
|                |<———————————————————————————[ GPSStatus  ]———————>|           |
|                |                            | fk_journey |
|                |                            | fk_gpsdev  |
|                |                            | datetime   |
|                |                            | latitude   |
|                |                            | longitude  |

However, as I mentioned in the comments, it all depends on what you're modeling. Please update your question with more details so other people can give you a better answer. Also, you should read something about database modeling so you start from a good basis and a good understanding of what's needed when modeling a database; it will also help you to ask better questions regarding database design.

Postgresql – Perf of query of large table by date

Here's my suggestion, which is somewhat different. I call it "log, aggregate, and snapshot."

Basically all your sensor input is coming in. We can assume or even require that it gets into the database within a specified period of time (1 day?). At the close of each interval we go back one interval into the past and snapshot our analytic info. In this way we can start with a snapshot before where we need and roll forward.

For example we might do this in a financial app (not really the same thing but poses similar challenges over time):

CREATE TABLE gl ( 
   id bigserial not null unique,
   reference text primary key
   description text,
   date_posted date not null
);
CREATE TABLE gl_lines (
   entry_id bigserial,
   account_id int not null references account(id),
   gl_id bigint not null references gl(id),
   amount numeric
);
CREATE TABLE eoy_checkpoint (
   date_ending date,
   account_id int references account(id)
   running_balance numeric not null,
   debits numeric,
   credits numeric
   primary key (date_ending, account_id)
);

Then we can maintain eoy_checkpoint when we close our books. We can also have a trigger that denies entry into gl_lines where it occurs on or before the most recent date in eoy_checkpoint. This allows us to still do aggregate reporting, but we can use the checkpoint as a point we can roll forwards from, and thus manage how much data is actually being aggregated.

I think a similar approach would be usable in your case. It handles this sort of thing very well, and allows a mixture of OLTP and more complex queries to be run without too much impact. It isn't really clear how normalized this is since there is duplication of data, but the data duplication allows you to do things like purche historical data without impacting your current running totals.

Best Answer

Related Solutions

Should we add extra 5 columns or build a separate table

Postgresql – Perf of query of large table by date

Related Question