Problem 1:
I have one file with 567 rows and 16,382 columns, most of which are floating point numbers. I have another file with 117,493 rows but only 3 columns. The contents deal with biology and genetics.
I have no idea what the numbers mean in the files mean. This is a college project where I have to be able to query different things given the files without knowing what the contents are. I am assuming we are not supposed to normalize the data because that would require understanding of the contents. We are graded on performance and reason for choosing which databases.
File 1: (567 rows, 16,382 columns)
I have been getting people surprised at the number of columns in this file. The columns are mostly genetic information with a bunch of numbers and cannot be normalized. I initially thought Postgresql would be good for File 1, but I read they are a row-oriented database, so it would be horrible. I read about Cassandra being good for column-oriented look-ups, but the problem is the most of the column and rows contain data and is mostly structured, except for a few areas with no values. Would it still be a better idea to use Cassandra?
Operations for File 1:
- simple look-up
- find mean
- find standard deviation
I am not able to confirm on the general performance between SQL and NoSQL on aggregate functions such as calculating mean and standard deviation. So this is another factor in selecting my database. I would assume Cassandra is faster than Postgresql here since Cassandra is column-oriented.
File 2: (117,493 rows, 3 columns)
Every single row and column contains data. I am guessing that Postgresql is good here since there are only 3 column but 117,493 rows, Postgresql is row-oriented, and there are no missing data. Is there a better NoSQL alternative in this case? Would a key-value store NoSQL be better since this will mostly be used for retrieval?
Operations for File 2:
- simple look-up
Problem 2:
Most of the large files I have are related by some kind of ID, such as File 1 and File 2. If I use Postgresql for File 2 and Cassandra for File 1, is there a considerable performance loss? Usually you would store related tables in an RDMS, but does the number of columns for File 1 mean it is better to use a hybrid approach?
Best Answer
PostgreSQL
This is generally true. And you should start at looking how you can normalize. Clearly you're averse to the idea, and you can't always normalize. For instance, a raster is often a collection of readings from a scientific instrument, and you can't really denormalize that. They're taken at the same time, represent the same sample and frequently have 1,000s of data points. (Though now you can store them in PostGIS).
Solution for Problem 1
PostgreSQL does not support 16,383 columns on a table (or more than 1,600+ for that matter), but it does supports SQL Arrays. You can probably make use of that,
Now you have a table that meets the first criteria but isn't insane.
Finding average is simple..
And for that whole operation
Execution time: 3865.308 ms
! Can complain about that. Feel free to use aMATERIALIZED VIEW
if you want to cache that average.For Standard Deviation just use the appropriate aggregate function in the place of
avg()
. Execution times are about the same.I'm not sure how you're querying this here, but without an index (worst case situation) is pretty simple and fast (considering it's a seq scan) scanning through an array with 16,000 items isn't slow either for a modern a CPU, and on PostgreSQL 9.6 this seq scan should even run in parallel (afaik).
Can you use an index? Sure. From the
intarray
module,Let's give it a shot...
Not sure what more you want than index scan on that. Probably be more useful if not every row had an element of
5
, but you know it's only sample data.Anyway have at it.
Solution for Problem 2
Shy of that, the other thing isn't a problem at all a bit of overhead because Pg rows are fat, but you're not talking many rows and ultimately it shouldn't matter this below table meeting that criteria is only 5088 kB on my system. Hardly worth concerning yourself with.