Mysql – Aggregate sampler function for programmer express predictions

aggregateMySQLmysql-5.7

Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns?

I will explain, see bold text at the end.

PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the SQL language, in general.


As typical table, we can imagine a SQL-view v1 of some piece of data:

Organization   | City           |  Country | Info1 | Info2
---            | ---            | ---      | ---   | ---
LocalOrg1      | San Francisco  |  US      | 10    | 23
LocalOrg1      | San Francisco  |  US      | 2     | 24
Armarinhos Fer | São Paulo      |  BR      | 11    | 55
Armarinhos Fer | São Paulo      |  BR      | 12    | 56

My (programmer's) prediction is that all organizations of v1 exists in only one city (eg. there are only one "San Francisco City Hall"). So,on basis of my prediction, the query

SELECT organization, city, country, 
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country  -- city,country predicted repeat

have some redundancy and some semantic limitation:
the grouping criteria is organization, there is no need to spend CPU checking city and country. And I can say "Hey John look that query grouped by organization", because it does not make sense to say to another human "organization, city and country" when he have the same prediction in mind. It was a SQL obligation but is not the semantic essence of the query.

Of course, the semantic may be better with GROUP by organization, 2,3 syntax sugar, or using max(),

SELECT organization, max(city) as city, max(country) as country, 
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization  -- better semantic, worse performance 

but max() lost time "comparing nothing" (because as I predicted there are only repeated things), and perhaps the query cost will be bigger than the first query.

Some DBMS also offer first() and last() which might be more efficient (!) then max(), and I guess that this is the simplest solution today: use some kind of first() function optimize performance and semantic, in basis of my prediction that city and country repeats when grouping by organization.

But no matter if the sample of a repeated column comes from first or last sampled row. The first/last can be also an internal optimization choice, so, the real need in this context is a kind of aggsample() function:

SELECT organization, 
       aggsample(city), aggsample(country) -- better semantic and perfornace
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization  -- better performance 

So, putting in this detailed context: Are there are any SQL language variation where this kind of function (aggsample) was defined?


EDITED after comments and some homework…

NOTES

Candidates and limitations:

  • PostgreSQL's DISTINCT ON. Seems to address the similar problem, but not solves the use in usual GROUP BY summarizations.

  • MySQL's ANY_VALUE, as suggested by @AndriyM. Seems perfect (!), but I never used… Will try later, install mysql and do some tests.

Similar discussions:

Trying

CREATE TABLE v1 (
  Organization text,  City text,
  Country text, Info1  int, Info2 int
);
INSERT INTO v1 VALUES 
 ('LocalOrg1', 'San Francisco', ' US', '10  ', '23'),
 ('LocalOrg1', 'San Francisco', ' US', '2   ', '24'),
 ('Armarinhos Fer', 'São Paulo', ' BR', '11  ', '55'),
 ('Armarinhos Fer', 'São Paulo', ' BR', '12  ', '56');

Extrange, PostgreSQL offer as valid syntax "DISTINCT ON + GROUP BY",

SELECT DISTINCT ON (organization) organization, city, country, 
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country

but not works better (worst performance) and is not valid the reduction ou grouping with only GROUP BY organization.

… no other test.

PS: the only valid syntax (and result) with DISTINCT ON of postgresql is

SELECT DISTINCT ON (organization) organization, City, Country, 
    sum(Info1) OVER w AS tot1,
    avg(info2) OVER w AS avg2,
    count(*)   OVER w as n
FROM v1
WINDOW w AS (PARTITION BY organization);

as we see, ugly syntax, and with worst performance.

Best Answer

Thanks @AndriyM, the link to this new features in MySQL 5.7 led me to the ANY_VALUE() function, that is exactly the "aggsample" of my question!

  • Performance conclusion: the use of ANY_VALUE() function represents some gain in performance, demonstrated by thousands of real world use cases of MySQL databases, and by the decision to preserve this advantage in nowadays versions of MySQL, that bring us with this new function.

  • Usability (friendliness) conclusion: ANY VALUE() makes possible semantic gain (takeoff the non-grouping columns) in the GROUP BY clause.

  • Philosophical conclusion: it is way to programmer express his/her predictions about repeated values.

so, ANY VALUE() is a good recommendation for any other SQL language!


NOTES

To test you need MySQL 5.7+, and check SELECT @@GLOBAL.sql_mode or set mode to ONLY_FULL_GROUP_BY.

The query works fine,

SELECT organization, any_value(city) AS city, any_value(country) as country,
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n 
FROM v1 GROUP BY organization;

Another interesting behaviour in MySQL is let user in a friendly mode, where any_value() is not necessary — so, is also possible to avoid "AS name" boring syntax. In this special (non-standard) mode, the following query have the same behaviour,

SELECT organization, city, country,
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n 
FROM v1 GROUP BY organization;