Postgresql – Optimal database for large number of columns or rows

cassandradatabase-recommendationnosqlpostgresql

Problem 1:

I have one file with 567 rows and 16,382 columns, most of which are floating point numbers. I have another file with 117,493 rows but only 3 columns. The contents deal with biology and genetics.

I have no idea what the numbers mean in the files mean. This is a college project where I have to be able to query different things given the files without knowing what the contents are. I am assuming we are not supposed to normalize the data because that would require understanding of the contents. We are graded on performance and reason for choosing which databases.

File 1: (567 rows, 16,382 columns)

I have been getting people surprised at the number of columns in this file. The columns are mostly genetic information with a bunch of numbers and cannot be normalized. I initially thought Postgresql would be good for File 1, but I read they are a row-oriented database, so it would be horrible. I read about Cassandra being good for column-oriented look-ups, but the problem is the most of the column and rows contain data and is mostly structured, except for a few areas with no values. Would it still be a better idea to use Cassandra?

Operations for File 1:

  • simple look-up
  • find mean
  • find standard deviation

I am not able to confirm on the general performance between SQL and NoSQL on aggregate functions such as calculating mean and standard deviation. So this is another factor in selecting my database. I would assume Cassandra is faster than Postgresql here since Cassandra is column-oriented.

File 2: (117,493 rows, 3 columns)

Every single row and column contains data. I am guessing that Postgresql is good here since there are only 3 column but 117,493 rows, Postgresql is row-oriented, and there are no missing data. Is there a better NoSQL alternative in this case? Would a key-value store NoSQL be better since this will mostly be used for retrieval?

Operations for File 2:

  • simple look-up

Problem 2:

Most of the large files I have are related by some kind of ID, such as File 1 and File 2. If I use Postgresql for File 2 and Cassandra for File 1, is there a considerable performance loss? Usually you would store related tables in an RDMS, but does the number of columns for File 1 mean it is better to use a hybrid approach?

Best Answer

PostgreSQL

@Dobob everything can be normalized :) If it's not very secret, what sort of data it is? (Asking as a geneticist-turned-DBA.)

This is generally true. And you should start at looking how you can normalize. Clearly you're averse to the idea, and you can't always normalize. For instance, a raster is often a collection of readings from a scientific instrument, and you can't really denormalize that. They're taken at the same time, represent the same sample and frequently have 1,000s of data points. (Though now you can store them in PostGIS).

Solution for Problem 1

PostgreSQL does not support 16,383 columns on a table (or more than 1,600+ for that matter), but it does supports SQL Arrays. You can probably make use of that,

CREATE TABLE foo AS
  SELECT
    r AS id,
    array_agg(c) AS genetic_stuff
  FROM generate_series(1,567)
    AS r
  CROSS JOIN LATERAL generate_series(1,16382)
    AS c
  GROUP BY r;

Now you have a table that meets the first criteria but isn't insane.

          Table "public.foo"
    Column     |   Type    | Modifiers 
---------------+-----------+-----------
 id            | integer   | 
 genetic_stuff | integer[] | 

Finding average is simple..

SELECT id, avg(x)
FROM foo
CROSS JOIN LATERAL unnest(genetic_stuff)
  AS t(x)
GROUP BY id;

And for that whole operation Execution time: 3865.308 ms! Can complain about that. Feel free to use a MATERIALIZED VIEW if you want to cache that average.

For Standard Deviation just use the appropriate aggregate function in the place of avg(). Execution times are about the same.

I'm not sure how you're querying this here, but without an index (worst case situation) is pretty simple and fast (considering it's a seq scan) scanning through an array with 16,000 items isn't slow either for a modern a CPU, and on PostgreSQL 9.6 this seq scan should even run in parallel (afaik).

SELECT * FROM foo
WHERE genetic_stuff @> ARRAY[5];

Can you use an index? Sure. From the intarray module,

Two GiST index operator classes are provided: gist__int_ops (used by default) is suitable for small- to medium-size data sets, while gist__intbig_ops uses a larger signature and is more suitable for indexing large data sets (i.e., columns containing a large number of distinct array values). The implementation uses an RD-tree data structure with built-in lossy compression.

Let's give it a shot...

CREATE INDEX ON foo
  USING gist(genetic_stuff gist__intbig_ops);
VACUUM FULL foo;

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Index Scan using foo_genetic_stuff_idx on foo  (cost=0.14..8.16 rows=1 width=22) (actual time=0.119..47.846 rows=567 loops=1)
   Index Cond: (genetic_stuff @> '{5}'::integer[])
 Planning time: 0.072 ms
 Execution time: 47.948 ms
(4 rows)

Not sure what more you want than index scan on that. Probably be more useful if not every row had an element of 5, but you know it's only sample data.

Anyway have at it.

Solution for Problem 2

Shy of that, the other thing isn't a problem at all a bit of overhead because Pg rows are fat, but you're not talking many rows and ultimately it shouldn't matter this below table meeting that criteria is only 5088 kB on my system. Hardly worth concerning yourself with.

CREATE TABLE baz
AS
  SELECT
    x AS id,
    x*2 AS x2,
    x*3 AS x3,
    x*4 AS x4
  FROM generate_series(1,117493)
    AS t(x);
Related Question