Mysql – Aggregate sampler function for programmer express predictions

aggregateMySQLmysql-5.7

Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns?

I will explain, see bold text at the end.

PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the SQL language, in general.

As typical table, we can imagine a SQL-view v1 of some piece of data:

Organization   | City           |  Country | Info1 | Info2
---            | ---            | ---      | ---   | ---
LocalOrg1      | San Francisco  |  US      | 10    | 23
LocalOrg1      | San Francisco  |  US      | 2     | 24
Armarinhos Fer | São Paulo      |  BR      | 11    | 55
Armarinhos Fer | São Paulo      |  BR      | 12    | 56

My (programmer's) prediction is that all organizations of v1 exists in only one city (eg. there are only one "San Francisco City Hall"). So,on basis of my prediction, the query

SELECT organization, city, country, 
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country  -- city,country predicted repeat

have some redundancy and some semantic limitation:
the grouping criteria is organization, there is no need to spend CPU checking city and country. And I can say "Hey John look that query grouped by organization", because it does not make sense to say to another human "organization, city and country" when he have the same prediction in mind. It was a SQL obligation but is not the semantic essence of the query.

Of course, the semantic may be better with GROUP by organization, 2,3 syntax sugar, or using max(),

SELECT organization, max(city) as city, max(country) as country, 
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization  -- better semantic, worse performance

but max() lost time "comparing nothing" (because as I predicted there are only repeated things), and perhaps the query cost will be bigger than the first query.

Some DBMS also offer first() and last() which might be more efficient (!) then max(), and I guess that this is the simplest solution today: use some kind of first() function optimize performance and semantic, in basis of my prediction that city and country repeats when grouping by organization.

But no matter if the sample of a repeated column comes from first or last sampled row. The first/last can be also an internal optimization choice, so, the real need in this context is a kind of aggsample() function:

SELECT organization, 
       aggsample(city), aggsample(country) -- better semantic and perfornace
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization  -- better performance

So, putting in this detailed context: Are there are any SQL language variation where this kind of function (aggsample) was defined?

EDITED after comments and some homework…

NOTES

Candidates and limitations:

PostgreSQL's DISTINCT ON. Seems to address the similar problem, but not solves the use in usual GROUP BY summarizations.
MySQL's ANY_VALUE, as suggested by @AndriyM. Seems perfect (!), but I never used… Will try later, install mysql and do some tests.

Similar discussions:

Trying

CREATE TABLE v1 (
  Organization text,  City text,
  Country text, Info1  int, Info2 int
);
INSERT INTO v1 VALUES 
 ('LocalOrg1', 'San Francisco', ' US', '10  ', '23'),
 ('LocalOrg1', 'San Francisco', ' US', '2   ', '24'),
 ('Armarinhos Fer', 'São Paulo', ' BR', '11  ', '55'),
 ('Armarinhos Fer', 'São Paulo', ' BR', '12  ', '56');

Extrange, PostgreSQL offer as valid syntax "DISTINCT ON + GROUP BY",

SELECT DISTINCT ON (organization) organization, city, country, 
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country

but not works better (worst performance) and is not valid the reduction ou grouping with only GROUP BY organization.

… no other test.

PS: the only valid syntax (and result) with DISTINCT ON of postgresql is

SELECT DISTINCT ON (organization) organization, City, Country, 
    sum(Info1) OVER w AS tot1,
    avg(info2) OVER w AS avg2,
    count(*)   OVER w as n
FROM v1
WINDOW w AS (PARTITION BY organization);

as we see, ugly syntax, and with worst performance.

Best Answer

Thanks @AndriyM, the link to this new features in MySQL 5.7 led me to the ANY_VALUE() function, that is exactly the "aggsample" of my question!

Performance conclusion: the use of ANY_VALUE() function represents some gain in performance, demonstrated by thousands of real world use cases of MySQL databases, and by the decision to preserve this advantage in nowadays versions of MySQL, that bring us with this new function.
Usability (friendliness) conclusion: ANY VALUE() makes possible semantic gain (takeoff the non-grouping columns) in the GROUP BY clause.
Philosophical conclusion: it is way to programmer express his/her predictions about repeated values.

so, ANY VALUE() is a good recommendation for any other SQL language!

NOTES

To test you need MySQL 5.7+, and check SELECT @@GLOBAL.sql_mode or set mode to ONLY_FULL_GROUP_BY.

The query works fine,

SELECT organization, any_value(city) AS city, any_value(country) as country,
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n 
FROM v1 GROUP BY organization;

Another interesting behaviour in MySQL is let user in a friendly mode, where any_value() is not necessary — so, is also possible to avoid "AS name" boring syntax. In this special (non-standard) mode, the following query have the same behaviour,

SELECT organization, city, country,
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n 
FROM v1 GROUP BY organization;

Related Solutions

Oracle 11g – Aggregate Function and Query

Simple, but "dirty":

with t as (
  select 1 as id, 'Y' as received from dual union
  select 1 as id, 'N' as received from dual union
  select 2 as id, 'N' as received from dual union
  select 2 as id, 'N' as received from dual union
  select 2 as id, 'Y' as received from dual union
  select 3 as id, 'N' as received from dual union
  select 3 as id, 'Y' as received from dual 
)
select min(max(received)) from t group by id;

M
-
Y

Change input (e.g. change id to 4 in last line):

with t as (
  select 1 as id, 'Y' as received from dual union
  select 1 as id, 'N' as received from dual union
  select 2 as id, 'N' as received from dual union
  select 2 as id, 'N' as received from dual union
  select 2 as id, 'Y' as received from dual union
  select 3 as id, 'N' as received from dual union
  select 4 as id, 'Y' as received from dual 
)
select min(max(received)) from t group by id;

M
-
N

Mysql – group by clause without aggregate function

This:

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:

SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');

or:

SELECT  users.*
FROM users
WHERE EXISTS (
    SELECT 1
    FROM timesheets 
    WHERE timesheets.user_id = users.id
    AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);

It works because users.id is the primary key, so all other fields of users are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name or whatever for any given users.id row.

(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR about needing to use an aggregate or include the field in the GROUP BY here).

NOTES

Trying

Best Answer

NOTES

Related Solutions

Oracle 11g – Aggregate Function and Query

Mysql – group by clause without aggregate function

Related Question