Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns?
I will explain, see bold text at the end.
PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the SQL language, in general.
As typical table, we can imagine a SQL-view v1
of some piece of data:
Organization | City | Country | Info1 | Info2
--- | --- | --- | --- | ---
LocalOrg1 | San Francisco | US | 10 | 23
LocalOrg1 | San Francisco | US | 2 | 24
Armarinhos Fer | São Paulo | BR | 11 | 55
Armarinhos Fer | São Paulo | BR | 12 | 56
My (programmer's) prediction is that all organizations of v1
exists in only one city (eg. there are only one "San Francisco City Hall"). So,on basis of my prediction, the query
SELECT organization, city, country,
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country -- city,country predicted repeat
have some redundancy and some semantic limitation:
the grouping criteria is organization
, there is no need to spend CPU checking city
and country
. And I can say "Hey John look that query grouped by organization", because it does not make sense to say to another human "organization, city and country" when he have the same prediction in mind. It was a SQL obligation but is not the semantic essence of the query.
Of course, the semantic may be better with GROUP by organization, 2,3
syntax sugar, or using max()
,
SELECT organization, max(city) as city, max(country) as country,
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization -- better semantic, worse performance
but max()
lost time "comparing nothing" (because as I predicted there are only repeated things), and perhaps the query cost will be bigger than the first query.
Some DBMS also offer first()
and last()
which might be more efficient (!) then max()
, and I guess that this is the simplest solution today: use some kind of first()
function optimize performance and semantic, in basis of my prediction that city and country repeats when grouping by organization.
But no matter if the sample of a repeated column comes from first or last sampled row. The first/last can be also an internal optimization choice, so, the real need in this context is a kind of aggsample()
function:
SELECT organization,
aggsample(city), aggsample(country) -- better semantic and perfornace
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization -- better performance
So, putting in this detailed context: Are there are any SQL language variation where this kind of function (aggsample
) was defined?
EDITED after comments and some homework…
NOTES
Candidates and limitations:
-
PostgreSQL's
DISTINCT ON
. Seems to address the similar problem, but not solves the use in usual GROUP BY summarizations. -
MySQL's
ANY_VALUE
, as suggested by @AndriyM. Seems perfect (!), but I never used… Will try later, install mysql and do some tests.
Similar discussions:
- https://stackoverflow.com/a/8373384/287948
- https://stackoverflow.com/q/36134657/287948
- … https://stackoverflow.com/a/20347763/287948
Trying
CREATE TABLE v1 (
Organization text, City text,
Country text, Info1 int, Info2 int
);
INSERT INTO v1 VALUES
('LocalOrg1', 'San Francisco', ' US', '10 ', '23'),
('LocalOrg1', 'San Francisco', ' US', '2 ', '24'),
('Armarinhos Fer', 'São Paulo', ' BR', '11 ', '55'),
('Armarinhos Fer', 'São Paulo', ' BR', '12 ', '56');
Extrange, PostgreSQL offer as valid syntax "DISTINCT ON + GROUP BY",
SELECT DISTINCT ON (organization) organization, city, country,
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country
but not works better (worst performance) and is not valid the reduction ou grouping with only GROUP BY organization
.
… no other test.
PS: the only valid syntax (and result) with DISTINCT ON
of postgresql is
SELECT DISTINCT ON (organization) organization, City, Country,
sum(Info1) OVER w AS tot1,
avg(info2) OVER w AS avg2,
count(*) OVER w as n
FROM v1
WINDOW w AS (PARTITION BY organization);
as we see, ugly syntax, and with worst performance.
Best Answer
Thanks @AndriyM, the link to this new features in MySQL 5.7 led me to the
ANY_VALUE()
function, that is exactly the "aggsample" of my question!Performance conclusion: the use of
ANY_VALUE()
function represents some gain in performance, demonstrated by thousands of real world use cases of MySQL databases, and by the decision to preserve this advantage in nowadays versions of MySQL, that bring us with this new function.Usability (friendliness) conclusion:
ANY VALUE()
makes possible semantic gain (takeoff the non-grouping columns) in theGROUP BY
clause.Philosophical conclusion: it is way to programmer express his/her predictions about repeated values.
so,
ANY VALUE()
is a good recommendation for any other SQL language!NOTES
To test you need MySQL 5.7+, and check
SELECT @@GLOBAL.sql_mode
or set mode toONLY_FULL_GROUP_BY
.The query works fine,
Another interesting behaviour in MySQL is let user in a friendly mode, where
any_value()
is not necessary — so, is also possible to avoid "AS name" boring syntax. In this special (non-standard) mode, the following query have the same behaviour,