PostgreSQL – How to Best Store Pixels in a Database

centos-7postgresqlpostgresql-9.5

When loading data into a table, I get the following error:

ERROR:  row is too big: size 8680, maximum size 8160

The table has 1000+ columns in it, which appears to be the problem. The general internet advice is "refactor!" or "normalize!". For instance, this post. Unfortunately, I don't believe such advice applies to my situation.

The table is to store data collected from a device. The device produces a PNG image as part of an analysis. The PNG consists of 1024 pixels. Each pixel has an associated numeric value. Along with the pixel data are various other fields related to the analysis. Breaking the table into parts doesn't really make sense. The fields are all logically associated with the particular object being analyzed.

Postgres doesn't seem to like that each pixel has its own field. The table has fields of the form: pixel_1, pixel_2, …, pixel_1024. Note that this is fundamentally different from the usual example of phone_number_1, phone_number_2, etc. Each pixel is a unique object by virtue of its location. pixel_1 has a different position than pixel_123 and each pixel has an associated value. The common aspect between them is that they both are used to describe the same analysis object. They are the quantitative analog to the visual representation given in the PNG.

Is there a way to increase the row size?
If the table simply cannot have 1000+ columns, how could I refactor this?
Assuming the first two answers are "No.", should I just stick the 1024 columns into an XML and throw that in a text field?

I hope I have made the context clear. I have tried to boil the problem down to its essence, but I suspect some clarification may be needed. Please let me know if clarification is needed.

EDIT: As an experiment, I tried breaking the pixels into a separate table. That seems to be the only possible way to refactor. But the 1024 columns produces the same error.

Best Answer

I would go for an array:

create table device
(
   id      integer primary key,
   pixels  integer[]
);

The drawback is, that you always need to read and write all pixels as it is a single column.

Note that Postgres does not enforce array limits. Even if you declare the column as integer[1024] you can still store more or less than 1024 pixels in it. If you need to put a constraint on that, you can use a check constraint.

An array is stored with a variable width and thus it's compressed.

Another option would be JSONB as Json offers at least some kind of data type information. I wouldn't go for XML nowadays. The JSON support is much better, the functions to query and manipulate JSON are more flexible and powerful than the XML functions (and given the current JSON hype, there is more momentum there as well). It seems that Postgres 11 will support the JSON functions from the SQL:2016 standard.

Denormalized structure

You're trying to model relations as objects. This can work, but it gets clumsy fast. It looks like a pretty direct mapping of your C++ classes, so it has lots of tables for specific types of things, and isn't normalized.

PostgreSQL has table inheritance features that make this approach more manageable, if you really want to pursue it. However the use of table inheritance is basically incompatible with the use of foreign keys due to the lack of unique index support across all child relations of an inheritance base relation. This makes inheritance much less useful than it could otherwise be.

Instead try to model it relationally. There are two schools of thought here, bottom-up and top-down modelling. In practice you use a mix of the two most of the time. Your goal is a well normalized database structure.

First, though...

`IPAddress` and `MACAddress` are unnecessary

You're also modelling IPAddress and MACAddress as entities, which they aren't. They're just scalar values. As far as I can tell these are probably direct translations of a C++ class or C struct that stores IP addresses as four uint8_t octets, MAC addresses as six uint8_t octets, etc. This is completely unnecessary in the database and should be done away with entirely. PostgreSQL provides native data types that serve these specific needs:

inet
macaddr

... but if you have to remain cross-database portable you would be better advised storing your IP addresses as numeric with a boolean indicating IPv4 or IPv6, rather than storing them as octets. You need numeric because a bigint (signed 64-bit integer) isn't big enough for an IPv6 address.

BTW, in the network a MAC address may have multiple IP addresses. While no IP address within a given connected subnet may have multiple MAC addresses, the same IP address may appear in discrete networks with different MAC addresses, so I wouldn't try to model this at the schema level. If I did, I'd create an NetworkIP relation with a primary key of (network_id integer REFERENCES Network, ipaddr) to allow the same IP to exist only on disconnected networks. But then you have to consider multilevel NAT and it gets horrifying. Don't go there. I'd leave them as scalars.

Relational model

Now that we've done away with MACAddress and IPAddress, lets look at structure.

You should flatten everything, but not like this:

It has 48 ports of 1 gig ethernet, 48 ports of 1 gig SFP and 48 ports of 10 gig SFP Plus. All in all, there are more than 1000 columns!

Rather than "wide", go "tall". Don't fear joins, joins are your friend.

You have network devices, of various specific kinds, where each kind has various additional characteristics. Each device may have one or more IP addresses and MAC addresses assigned to it (e.g. switch management addresses). Additionally, each device may have zero or more ports, and each port has one MAC address and zero or more IP addresses. To simplify things we define management interfaces as virtual ports.

So lets model that core structure.

CREATE TABLE device (
    id serial primary key,
    device_type text not null,
    device_name text,
    description text
);

-- A port is a logical port, which may correspond to a physical
-- plug, or a virtual port like an internal management interface,
-- IPMI controller, etc.
CREATE TABLE port (
    device_id integer not null references device(id),
    port_name text not null,
    port_mac macaddr not null,
    port_is_virtual boolean not null default 'f',
    PRIMARY KEY (device_id, port_name),
    UNIQUE(device_id, port_mac)
);

-- Now the IP address(es) for a port
CREATE TABLE port_ip (
    port_id integer not null references port(id),
    port_ipaddr ipaddr not null
);

-- A socket is a physical plug, which generally has a logical
-- port associated with it.
CREATE TABLE socket (
  device_id integer references device(id),
  socket_index integer not null,
  socket_name text,
  -- This might be a good use for an enum type, but for now
  -- just use a check constraint
  socket_type text not null check (socket_type IN ('rj45', 'sfp', 'sfpplus'),
  PRIMARY KEY(device_id, socket_index)
);

-- A port may have a socket or not, and a socket may have a port
-- or not, so model via a join table. If you want to prevent
-- a socket having multiple ports or a port having multiple sockets
-- you could use additional unique constraints.
CREATE TABLE port_sockets (
   device_id integer references device(id),
   port_name text references port(port_name),
   socket_index integer references socket(socket_index),
   PRIMARY KEY (device_id, port_name, socket_index)
);

See how it's "vertical" - we have a port table with entries for each port for each host, etc.

So now you want to model switches. These are network devices. They have a certain number of ports based on the model.

You can keep track of the number of ports directly in the entity, or you can just count the number of ports for that switch when you want to know. While keeping track of the port count in the switch entity is a denormalization that duplicates data, in this case it's probably worth it. So I'd do something like:

CREATE TABLE switch (
  switch_id serial primary key,
  device_id integer references device(id),
  nports integer not null,
  model text,
  management_port_name text,

  constraint switch_management_port
   foreign key (device_id, management_port_name)
   references port(device_id, port_name)

);

etc.

The switch is a network device, and ports are associated with it via the device table.

All this is a bit hand-wavey since I don't know exactly what you need to model, but should help you start thinking about it in a way that'll get you moving again.

Learn to love joins.

Modelling connections

To model connections between devices you might want to use an adjacency list to model it as an undirected cyclic graph. That sounds hairy, but it's really not, it's just a simple table saying "this one is connected to this one" without bothering to say which is "from" and which is "to", treating them as symmetrical.

There's lots of established literature on querying adjacency lists from SQL. You'll probably want to get friendly with recursive common table expressions (WITH RECURSIVE) if you want to do it efficiently, but for simple cases you can just do loops in C, or fetch relevant subetss of the adjacency list and search it with boost::graph.

e.g.

CREATE TABLE socket_connection (
    first_device_id integer references device(id),
    first_socket_index integer,
    second_device_id integer references device(id),
    second_socket_index integer,

    PRIMARY KEY (first_device_id, first_socket_index,
                 second_device_id, second_socket_index),

    UNIQUE(first_device_id, first_socket_index),

    UNIQUE(second_device_id, second_socket_index),

    FOREIGN KEY (first_device_id, first_socket_index)
      REFERENCES socket(device_id, socket_index),

    FOREIGN KEY (second_device_id, second_socket_index)
      REFERENCES socket(device_id, socket_index),

    -- A simple way to prevent both (1, 1, 2, 1) and (2, 1, 1, 1)
    -- being stored as separate connections:

    CONSTRAINT first_device_id_is_always_lowest
      CHECK (first_device_id < second_device_id)
);

... which looks hideous, but is actually just a mapping between two sockets plus a lot of sanity checking.

A note on composite keys

I've used composite keys like (device_id, socket_index) a lot above.

Some people like them, some don't. There are advantages and disadvantages to their use. If you want to use generated surrogate keys everywhere, that's fine too.

Reinventing the wheel

You might think "hey, has this been done before?".

I'm quite sure it has. Whether the existing models match your needs is less certain, but I suggest looking over some of them before you go and invent your own from scratch.

PostgreSQL Large Objects and Multiple Users

One option is to use SET ROLE command after you open a connection from your application to the databaserole:

SET ROLE databaserole;

In such way any object (including large objects) created within this session will be owned by databaserole instead of token-XXX.

Another option is to use LOCAL and make it work only within the transaction used to create the large object:

BEGIN;
SET LOCAL ROLE databaserole;
-- create and insert the large object
COMMIT;

Alternatively, you could just set the role directly as the user setting:

ALTER ROLE "{{name}}" SET role TO 'databaserole';

I kind of dislike that option as it can become a bit obscure to others how it is working, and if you manage other roles to your user it won't inherit them (although that doesn't seem like a problem to you, as with dynamically created user you should have a single role that it directs inherit from).

Best Answer

Related Solutions

Postgresql – How to model complex systems like networks in Postgres