I thought about another ColumnFamily "ArticlesByDateAndCountry" dynamic columns.
You're on the right track with this thought, but I'd stay away from dynamic columns. Currently, there isn't a way to really manage column families with dynamic columns in CQL3. So, the only real solution is to go the route of creating it in the cassandra-cli
, which is being deprecated. Sticking with CQL tables gives you a much easier path to data access, and that in itself (IMO) makes it worth it. Besides, all the latest drivers work with CQL, so you're really backing yourself into a corner by choosing a path that can't be managed by it.
While it may not be readily apparent, there is a way to solve your current problem to adequately serve your queries. I would build a (CQL) table, like this:
CREATE TABLE ArticlesByDateAndCountry (
countrycode text,
articledate timestamp,
field1 text,
field2 text,
PRIMARY KEY ((countrycode),articledate))
WITH CLUSTERING ORDER BY (articledate DESC);
Note: I have created two sample payload fields, field1 and field2. I'm sure your payload fields will vary. Also, I have opted to use a timestamp instead of a timeuuid, as it makes the example easier.
Essentially, this will group your data by countrycode
. And within each countrycode
, your data will be sorted by articledate
.
SELECT articledate, field1, field2
FROM ArticlesByDateAndCountry
WHERE countrycode='US'
AND articledate >= '2015-01-23 00:00:00' AND articledate < '2015-01-24 00:00:00';
You should read Patrick McFadin's article Getting started with Cassandra time series data modeling. It has several examples that are quite similar to what you are doing here. While I have demonstrated this with the timestamp type, you could very easily make this work with time UUIDs instead. Here is a link to DataStax's documentation on Cassandra's timeuuid functions that I am sure you will find useful.
You are right in thinking that items of a list should be stored as individual values of a column, i.e. in separate rows of that column, rather than as a single value (CSV string or anything like that) – at least if you expect to have queries against individual items of the list. Relational databases are designed to work that way, and storing multiple values of a column as a list (single value) is a no-no in such cases.
Regarding your data sample of fruits and vegetables, I am assuming that you are not thinking of splitting two lists of values exactly like that, storing each list's elements alongside each other on the same row without there actually being a relationship between them, just for the sake of normalisation of the way the vegetables and fruits are stored. That would be wrong.
I mean, if currently you have a row with a list of fruits and a list of vegetables such that each fruit and each vegetable are related to this particular row, i.e. like this:
FoodsID Fruits Vegetables other columns
------- -------------------------- ------------- ----- -------
100 Apple,Avocado,Mango,Orange Corn,Tomatoes ... ...
and you want to expand each list into a row set while keeping the relationship of each fruit and each vegetable to the Foods
row, you do not just add more rows and put the first item of one list together with the first item of the other list on the same row, then the second item of each list on another row and so on, like this:
FoodsID Fruits Vegetables ...
------- ------- ---------- -----
100 Apple Corn ...
100 Avocado Tomatoes ...
100 Mango ...
100 Orange ...
Again, that would be wrong. Fruit Apple
may well be related to food item 100 as may vegetable Corn
, but if these two items are not related to each other, there is no reason to store them on the same row.
Commonly, when you want to store a relationship between a subset and an item, you use a separate table. As you have two subsets of different kinds that are related to the same item, you just use two tables. So it would be:
table Foods
:
FoodsID other columns
------- ----- -------
100 ... ...
table Fruits
:
FoodsID Fruit
------- -------
100 Apple
100 Avocado
100 Mango
100 Orange
table Vegetables
:
FoodsID Vegetable
------- ---------
100 Corn
100 Tomatoes
where FoodsID
in table Foods
would be the primary key of that table and same-named column in tables Fruits
and Vegetable
would serve as a reference (foreign key) to the corresponding row in Foods
.
This way, when querying against individual items of either list as belonging to that particular row in Foods
, you would probably use a join. Your typical filter conditions would be as simple as Vegetable = 'some vegetable'
or Fruit IN ('some fruit', 'some other fruit')
– not FindPos(Vegetable, 'some vegetable') > 0
or anything of the kind. The conditions would be simpler because each column would now contain one value per row. That will lend well to further optimisation of performance, because when you query often against a column value (without applying a function to it first, that is), you can make such queries faster by adding an index on the column – something that would be pointless for lists stored as a single value.
More can be said on this topic, which would, however, be beyond the scope of your question. I suggest you look up the terms I highlighted in bold for more information.
Best Answer
In
SELECT
statements the ordering of the returned rows is not guaranteed if theORDER BY
clause is not specified. This is true for all tables, simple or complicated queries.Now, having that as basis, we should consider that Postgres has implemented CTEs in a peculiar way. They are always materialized (see: PostgreSQL’s CTEs are optimisation fences).
That doesn't mean that you should rely on this. When using a CTE, like in the first query, an
ORDER BY
should be added as well when referencing the CTE, as theORDER BY
inside the CTE may be removed (I'm not saying that it will always be removed as my tests show that a sort is performed. But it may be removed in a future optimizer change, as anORDER BY
withoutLIMIT
is redundant):Update, from Craig Ringer, after a comment I made in his above linked blog post:
So, if you want the results ordered in a specific ordering, add an
ORDER BY
in the finalSELECT
. You could get away without it now (and you might get some slight performance gain) but there is no guarantee that this will not change in a future release.Another reason for explicitly using
ORDER BY
is that you may not be always maintaining the code you write now. Another developer may try to reorganize the query without CTEs, using derived tables orLATERAL
joins. They would need to know that your query relies in the order provided by the CTE so it needs to be either used in the finalSELECT
or commented or documented somewhere.Now, another way to write the query would be adding a
LIMIT
in the CTE and usingROW_NUMBER()
:This way, the CTE will only be materialized for the 5100 rows you want and not for all (possibly millions) rows of the table. Even better if there is an index on
(col1, col2)
. You could even remove thegenerate_series()
and/or remove the CTE: