Postgresql – Enforce Functional Dependency in GROUP BY Clause

group bypostgresql

Let's consider I have the following tables:

    CREATE TABLE users(
        id serial PRIMARY KEY,
        age integer
    )

    CREATE TABLE products(
        id serial PRIMARY KEY,
        sku character varying(255),
        user_id integer REFERENCES users
    )

So basically a product would belong to a user and a user could have many products.

I want the ability to load users with their products doing the aggregation at the database level (Maybe wrongly but I assume it would be easier and probably more efficient than running some aggregation code in my application layer: Postgres comes with nice aggregation functions, why not using them !)

So I would run something like

SELECT "users".*, json_agg("products".*) as "products" 
  FROM "users" LEFT JOIN "products" ON "users"."id" = "products"."user_id" 
  GROUP BY "users"."id"

All good, my driver can parse JSON structures and I have nothing to do on the application layer side.

The problem rises if I want to introduce some sort of pagination:

WITH "users" AS (SELECT * FROM "users" ORDER BY "id" LIMIT 20)
SELECT "users".*, json_agg("products".*) as "products" 
FROM "users" LEFT JOIN "products" ON "users"."id" = "products"."user_id" 
GROUP BY "users"."id"

I have the known error

ERROR: column "users.age" must appear in the GROUP BY clause or be used in an aggregate function

As the "users"."id" is not considered as the primary key of the temporary table created by my subquery

I could fix it by adding every users column in the group by clause. But I find it troublesome and to be a shame as I am sure "users"."id" will define in a unique way an item of my subquery.

So, I would like to know if there is a way to tell the database engine "users"."id" is some sort of primary key for my subquery ?

If not, do you see a better way ?

EDIT: This question is quite similar (3 years old though)

Thanks

Best Answer

create table users (user_id int, user_name text);

insert into users values
(1, 'user1'),(2, 'user2'),(3, 'user3'),(4, 'user4'),(5, 'user5');

create table products (user_id int, id int, name text, att1 text);

insert into products values
(2, 1, 'prod1', 'att1'),
(2, 2, 'prod2', 'att1'),
(2, 3, 'prod3', 'att1'),
(2, 4, 'prod4', 'att1'),
(3, 5, 'prod5', 'att1'),
(3, 6, 'prod6', 'att1');

IMHO your first query should throw the same error message.

select
    users.*,
    json_agg(products.*) as products
from
    users
left join products on
    users.user_id = products.user_id;

ERROR: column "users.user_id" must appear in the GROUP BY clause or be used in an aggregate function LINE 2: users.*,
^

But you can move json_agg to a subquery instead of joining both tables, then you can easily limit the number of returned rows.

select 
    users.*,
    (select 
        json_agg(products.*)
     from
        products
     where
        user_id = users.user_id)  as products
from
    users
limit 3;

+---------+-----------+------------------------------------------------------+
| user_id | user_name | products                                             |
+---------+-----------+------------------------------------------------------+
| 1       | user1     |                                                      |
+---------+-----------+------------------------------------------------------+
| 2       | user2     | [{"user_id":2,"id":1,"name":"prod1","att1":"att1"},  |
|         |           |  {"user_id":2,"id":2,"name":"prod2","att1":"att1"},  |
|         |           |  {"user_id":2,"id":3,"name":"prod3","att1":"att1"},  |
|         |           |  {"user_id":2,"id":4,"name":"prod4","att1":"att1"}]  |
+---------+-----------+------------------------------------------------------+
| 3       | user3     | [{"user_id":3,"id":5,"name":"prod5","att1":"att1"},  |
|         |           |  {"user_id":3,"id":6,"name":"prod6","att1":"att1"}]  |
+---------+-----------+------------------------------------------------------+

db<>fiddle here

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Mysql – What corner cases exist when relying on undocumented behaviour to determine values selected by MySQL for hidden columns in GROUP BY operations

I was thinking the NATURAL JOIN example you just used

SELECT * FROM my_table NATURAL JOIN (
  SELECT   group_col, MAX(sort_col) sort_col
  FROM     my_table
  GROUP BY group_col
) t

If you shift to another type of JOIN and impose WHERE, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY.

For this example, I will

use Windows 7
use MySQL 5.5.12-log for Windows
create some sample data
impose a LEFT JOIN without a WHERE clause
impose a LEFT JOIN with a WHERE clause

For the DB Environment

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)

mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win64                        |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)

mysql>

Using this script to generate sample data

DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
    id int not null auto_increment,
    num int,
    primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;

and these two queries for testing the GROUP BY subsequent use;

SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;

Let's test the durability of the GROUP BY's results;

STEP 01 : Create the Sample Data

mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)

mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)

mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
    -> (
    ->     id int not null auto_increment,
    ->     num int,
    ->     primary key (id)
    -> );
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO groupby (num) VALUES
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM groupby;
+----+------------+
| id | num        |
+----+------------+
|  1 |  269529129 |
|  2 |  387090406 |
|  3 | 1126864683 |
|  4 |  411160755 |
|  5 |   29173595 |
|  6 |  266349579 |
|  7 | 1244227156 |
|  8 |    6231766 |
|  9 |  269529129 |
| 10 |  387090406 |
| 11 | 1126864683 |
| 12 |  411160755 |
| 13 |   29173595 |
| 14 |  266349579 |
| 15 | 1244227156 |
| 16 |    6231766 |
+----+------------+
16 rows in set (0.00 sec)

STEP 02 : Use `LEFT JOIN` without a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id);
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
|  1 |  269529129 |       NULL |
|  2 |  387090406 |       NULL |
|  3 | 1126864683 |       NULL |
|  4 |  411160755 |       NULL |
|  5 |   29173595 |       NULL |
|  6 |  266349579 |       NULL |
|  7 | 1244227156 |       NULL |
|  8 |    6231766 |       NULL |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 |  411160755 |  411160755 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 |    6231766 |    6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)

mysql>

STEP 03 : Use `LEFT JOIN` with a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
| 16 |    6231766 |    6231766 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 12 |  411160755 |  411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)

mysql>

ANALYSIS

Looking at the aforementioned results, here are two questions:

Why does a LEFT JOIN keep an ordering by id ?
Why in the world did using a WHERE impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?

No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.

CONCLUSION

From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY in conjunction with the following twelve(12) aspects:

aggregate functions
subquery usage
JOINs clauses
WHERE clauses
sort order of results with no explicit ORDER BY clause
query results using older GA releases of MySQL
query results using newer beta releases of MySQL
the current SQL_MODE setting in my.cnf
the operating system the code was compiled for
possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
possibly the storage engine being used (MyISAM vs InnoDB)

Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record