PostgreSQL bytea vs smallint[]

blobbyteadatatypespostgresql

I'm looking to import large (100Mb — 1 GB) multi-channel time-series data into a PostgreSQL database. The data comes from EDF format files that chunks the data into "records" or "epochs" of typically a few seconds each. Each epoch's record holds the signals for each channel of data as sequential arrays of short integers.

I'm mandated to store the files within the database, in the worst case as BLOBs. Given that, I'd like to investigate options that would allow me to do something more with the data within the database, such as facilitating queries based upon the signal data.

My initial plan is to store the data as one row per epoch record. What I'm trying to weigh up is whether to store the actual signal data as bytea or smallint[] (or even smallint[][]) types. Could anyone recommend one over the other? I'm interested in storage and access costs. Usage is likely to be insert once, read occasionally, update never. If one were more easily wrapped up as a custom type such that I could add functions for analysing of comparing records then so much the better.

No doubt I'm low on detail, so feel free to add comments on what you'd like me to clarify.

Best Answer

In the absence of any answers I've explored the issue further myself.

It looks like user-defined functions can handle all base types, including bytea and smallint[], so this doesn't affect the choice of representation much.

I tried out several different representations on a PostgreSQL 9.4 server running locally on a Windows 7 laptop with a vanilla configuration. The relations to store that actual signal data were as follows.

Large Object for entire file

CREATE TABLE BlobFile (
    eeg_id INTEGER PRIMARY KEY,
    eeg_oid OID NOT NULL
);

SMALLINT array per channel

CREATE TABLE EpochChannelArray (
    eeg_id INT NOT NULL,
    epoch INT NOT NULL,
    channel INT,
    signal SMALLINT[] NOT NULL,
    PRIMARY KEY (eeg_id, epoch, channel)
);

BYTEA per channel in each epoch

CREATE TABLE EpochChannelBytea (
    eeg_id INT NOT NULL,
    epoch INT NOT NULL,
    channel INT,
    signal BYTEA NOT NULL,
    PRIMARY KEY (eeg_id, epoch, channel)
);

SMALLINT 2D array per epoch

CREATE TABLE EpochArray (
    eeg_id INT NOT NULL,
    epoch INT NOT NULL,
    signals SMALLINT[][] NOT NULL,
    PRIMARY KEY (eeg_id, epoch)
);

BYTEA array per epoch

CREATE TABLE EpochBytea (
    eeg_id INT NOT NULL,
    epoch INT NOT NULL,
    signals BYTEA NOT NULL,
    PRIMARY KEY (eeg_id, epoch)
);

I then imported a selection of EDF files into each of these relations via Java JDBC and compared the growth in database size after each upload.

The files were:

  • File A: 2706 epochs of 16 channels, each channel 1024 samples (16385 samples per epoch), 85 MB
  • File B: 11897 epochs of 18 channels, each channel 1024 samples (18432 samples per epoch), 418 MB
  • File C: 11746 epochs of 20 channels, each channel 64 to 1024 samples (17088 samples per epoch), 382 MB

In terms of storage cost, here's the size occupied in MB for each case: Storage cost in MB

Relative to the original file size, Large Objects were about 30-35% larger. By contrast, storing each epoch as either a BYTEA or SMALLINT[][] was less than 10% larger. Storing each channel as a separate tuple give a 40% increase, as either BYTEA or SMALLINT[], so not much worse than storing as a large object.

One thing I hadn't initially appreciated is that "Multidimensional arrays must have matching extents for each dimension" in PostgreSQL. This means that the SMALLINT[][] representation only works when all channels in an epoch have the same number of samples. Hence File C fails to work with the EpochArray relation.

In terms as access costs, I haven't played around with this, but at least in terms of inserting the data initially the fastest representation was EpochBytea and BlobFile, with EpochChannelArray the slowest, taking about 3 times as long as the first two.