I'm looking to import large (100Mb — 1 GB) multi-channel time-series data into a PostgreSQL database. The data comes from EDF format files that chunks the data into "records" or "epochs" of typically a few seconds each. Each epoch's record holds the signals for each channel of data as sequential arrays of short integers.
I'm mandated to store the files within the database, in the worst case as BLOBs. Given that, I'd like to investigate options that would allow me to do something more with the data within the database, such as facilitating queries based upon the signal data.
My initial plan is to store the data as one row per epoch record. What I'm trying to weigh up is whether to store the actual signal data as bytea or smallint[] (or even smallint[][]) types. Could anyone recommend one over the other? I'm interested in storage and access costs. Usage is likely to be insert once, read occasionally, update never. If one were more easily wrapped up as a custom type such that I could add functions for analysing of comparing records then so much the better.
No doubt I'm low on detail, so feel free to add comments on what you'd like me to clarify.
Best Answer
In the absence of any answers I've explored the issue further myself.
It looks like user-defined functions can handle all base types, including
bytea
andsmallint[]
, so this doesn't affect the choice of representation much.I tried out several different representations on a PostgreSQL 9.4 server running locally on a Windows 7 laptop with a vanilla configuration. The relations to store that actual signal data were as follows.
Large Object for entire file
SMALLINT array per channel
BYTEA per channel in each epoch
SMALLINT 2D array per epoch
BYTEA array per epoch
I then imported a selection of EDF files into each of these relations via Java JDBC and compared the growth in database size after each upload.
The files were:
In terms of storage cost, here's the size occupied in MB for each case:
Relative to the original file size, Large Objects were about 30-35% larger. By contrast, storing each epoch as either a BYTEA or SMALLINT[][] was less than 10% larger. Storing each channel as a separate tuple give a 40% increase, as either BYTEA or SMALLINT[], so not much worse than storing as a large object.
One thing I hadn't initially appreciated is that "Multidimensional arrays must have matching extents for each dimension" in PostgreSQL. This means that the
SMALLINT[][]
representation only works when all channels in an epoch have the same number of samples. Hence File C fails to work with theEpochArray
relation.In terms as access costs, I haven't played around with this, but at least in terms of inserting the data initially the fastest representation was
EpochBytea
andBlobFile
, withEpochChannelArray
the slowest, taking about 3 times as long as the first two.