There are a couple of concepts which need to be distinguished. One is about structure and the other about schema.
Structured data is one where the application knows in advance the meaning of each byte it receives. A good example is measurements from a sensor. In contrast a Twitter stream is unstructured. Schema is about how much of the structure is communicated to the DBMS as how it is asked to enforce this. It controls how much the DBMS parses the data it stores. A schema-required DBMS such as SQL Server can store unparsed data (varbinary) or optionally-parsed data (xml) and fully parsed data (columns).
NoSQL DBMSs lie on a spectrum from no parsing (key-value stores) upwards. Cassandra offers reatively rich functionality in this respect. Where they differ markedly to relational stores is in the uniformity of the data. Once a table is defined only data which matches that definition may be held there. In Cassandra, however, even if columns and families are defined there is no requirement for any two rows in the same table to look anything like each other. It falls to the application designer to decide how much goes in a single row (also referred to as a document) and what is held separately, linked by pointers. In effect, how much denormalisation do you want.
The advantage is you can retrieve a full set of data with a single sequential read. This is fast. One downside is that you, the application programer, are now solely responsible for all data integrity and backward compatibility concerns, for ever, for every bit of code that ever touches this data store. That can be difficult to get right. Also, you are locked into one point of view on the data. If you key your rows by order number, how do you report on the sale on one particular product, or region, or customer?
PostgreSQL
@Dobob everything can be normalized :) If it's not very secret, what sort of data it is? (Asking as a geneticist-turned-DBA.)
This is generally true. And you should start at looking how you can normalize. Clearly you're averse to the idea, and you can't always normalize. For instance, a raster is often a collection of readings from a scientific instrument, and you can't really denormalize that. They're taken at the same time, represent the same sample and frequently have 1,000s of data points. (Though now you can store them in PostGIS).
Solution for Problem 1
PostgreSQL does not support 16,383 columns on a table (or more than 1,600+ for that matter), but it does supports SQL Arrays. You can probably make use of that,
CREATE TABLE foo AS
SELECT
r AS id,
array_agg(c) AS genetic_stuff
FROM generate_series(1,567)
AS r
CROSS JOIN LATERAL generate_series(1,16382)
AS c
GROUP BY r;
Now you have a table that meets the first criteria but isn't insane.
Table "public.foo"
Column | Type | Modifiers
---------------+-----------+-----------
id | integer |
genetic_stuff | integer[] |
Finding average is simple..
SELECT id, avg(x)
FROM foo
CROSS JOIN LATERAL unnest(genetic_stuff)
AS t(x)
GROUP BY id;
And for that whole operation Execution time: 3865.308 ms
! Can complain about that. Feel free to use a MATERIALIZED VIEW
if you want to cache that average.
For Standard Deviation just use the appropriate aggregate function in the place of avg()
. Execution times are about the same.
I'm not sure how you're querying this here, but without an index (worst case situation) is pretty simple and fast (considering it's a seq scan) scanning through an array with 16,000 items isn't slow either for a modern a CPU, and on PostgreSQL 9.6 this seq scan should even run in parallel (afaik).
SELECT * FROM foo
WHERE genetic_stuff @> ARRAY[5];
Can you use an index? Sure. From the intarray
module,
Two GiST index operator classes are provided: gist__int_ops (used by default) is suitable for small- to medium-size data sets, while gist__intbig_ops uses a larger signature and is more suitable for indexing large data sets (i.e., columns containing a large number of distinct array values). The implementation uses an RD-tree data structure with built-in lossy compression.
Let's give it a shot...
CREATE INDEX ON foo
USING gist(genetic_stuff gist__intbig_ops);
VACUUM FULL foo;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Index Scan using foo_genetic_stuff_idx on foo (cost=0.14..8.16 rows=1 width=22) (actual time=0.119..47.846 rows=567 loops=1)
Index Cond: (genetic_stuff @> '{5}'::integer[])
Planning time: 0.072 ms
Execution time: 47.948 ms
(4 rows)
Not sure what more you want than index scan on that. Probably be more useful if not every row had an element of 5
, but you know it's only sample data.
Anyway have at it.
Solution for Problem 2
Shy of that, the other thing isn't a problem at all a bit of overhead because Pg rows are fat, but you're not talking many rows and ultimately it shouldn't matter this below table meeting that criteria is only 5088 kB on my system. Hardly worth concerning yourself with.
CREATE TABLE baz
AS
SELECT
x AS id,
x*2 AS x2,
x*3 AS x3,
x*4 AS x4
FROM generate_series(1,117493)
AS t(x);
Best Answer
Ok something must have gone wrong in your upgrade, because that's not right. When I check my system_schema 3.4, I see this:
There are definitely more than two tables in that keyspace.
The new way to do this, is to query system_schema.keyspaces:
The main difference between system.schema_keyspaces and system_schema.keyspaces, is that system_schema.keyspaces only has 3 columns instead of two (
strategy_class
andstrategy_options
were combined intoreplication
).