Postgresql – Find the greatest difference between each unique record with different timestamps

greatest-n-per-grouporder-bypostgresqlrankwindow functions

I have a PostgreSQL table with the following schema and data:

CREATE TABLE IF NOT EXISTS T(
     id uuid PRIMARY KEY,
     username varchar(15),
     person varchar(10),
     tweets int,
     followers int,
     following int,
     likes int,
     created_at date) 
;
 
 id                                   | username     | person     | tweets | followers | following | likes | created_at
 :----------------------------------- | :----------- | :--------- | -----: | --------: | --------: | ----: | :---------
 3fa34100-d688-4051-a687-ec49d05e7212 | renok        | null       |    110 |         6 |         0 |     0 | 2020-10-10
 bab9ceb9-2770-49ea-8489-77e5d763a223 | Lydia_C      | test user2 |    515 |      1301 |      1852 |  1677 | 2020-10-10
 4649077a-9188-4821-a1ec-3b38608ea44a | Kingston_Sav | null       |   2730 |      1087 |      1082 |  1339 | 2020-10-10
 eef80836-e140-4adc-9598-8b612ab1825b | TP_s         | null       |   1835 |       998 |       956 |  1832 | 2020-10-10
 fd3ff8c7-0994-40b6-abe0-915368ab9ae5 | DKSnr4       | null       |    580 |       268 |       705 |   703 | 2020-10-10
 3fa34100-d688-4051-a687-ec49d05e7312 | renok        | null       |    119 |         6 |         0 |     0 | 2020-10-12
 bab9ceb9-2770-49ea-8489-77e5d763a224 | Lydia_C      | test user2 |    516 |      1301 |      1852 |  1687 | 2020-10-12
 4649077a-9188-4821-a1ec-3b38608ea44B | Kingston_Sav | null       |   2737 |      1090 |      1084 |  1342 | 2020-10-12
 eef80836-e140-4adc-9598-8b612ae1835c | TP_s         | null       |   1833 |       998 |       957 |  1837 | 2020-10-12
 fd3ff8c7-0994-40b6-abe0-915368ab7ab5 | DKSnr4       | null       |    570 |       268 |       700 |   703 | 2020-10-12

I intend to get the biggest difference between the most recent date and the next most recent date for each unique username and the find the username with the largest margin (difference) for example..In the above table the most recent date is 2020-10-12 and the next most recent date is 2020-10-10.

So I want to get something like this

 id                                   | username     | person | tweets | followers | following | likes | created_at | prev_followers | gain
 :----------------------------------- | :----------- | :----- | -----: | --------: | --------: | ----: | :--------- | -------------: | ---:
 4649077a-9188-4821-a1ec-3b38608ea44a | Kingston_Sav | null   |   2737 |      1090 |      1084 |  1342 | 2020-10-12 |           1087 |    3

Best Answer

Many ways lead to Rome. The below should be a good one (fast and flexible) to "find the username with the largest margin"

Assuming all involved columns are defined NOT NULL. And each username can only have one entry per day. Else you have to do more.

WITH cte AS (
   SELECT *, dense_rank() OVER (ORDER BY created_at DESC) AS rnk
   FROM   tbl
   )
SELECT d1.*
     , d2.followers AS prev_followers
     , d1.followers - d2.followers AS gain
FROM  (SELECT * FROM cte WHERE rnk = 1) d1
JOIN  (SELECT * FROM cte WHERE rnk = 2) d2 USING (username)
ORDER  BY gain DESC
        , d1.followers, username  -- added tiebreaker
LIMIT  1;

The CTE named cte attaches rank numbers with dense_rank() (not rank(), not row_number()). Then join the latest day (rnk = 1) with the one before (rnk = 2) and calculate the gain. Obviously, users must have entries for both days to qualify. Finally order by the gain and take the first row.

Note the added ORDER BY expressions to try and break possible ties: there can be multiple users with the same gain, so you have to define how to deal with that. One way is to add tiebreakes. In my example, a user with a smaller absolute number of followers is preferred (higher relative gain), and the alphabetically first wins if that's still ambiguous.

Or you return all "winners":

Again, many ways ... Postgres 13 added the standard SQL clause WITH TIES for that purpose exactly:

WITH cte AS (
   SELECT *, dense_rank() OVER (ORDER BY created_at DESC) AS rnk
   FROM   tbl
   )
SELECT d1.*
     , d2.followers AS prev_followers
     , d1.followers - d2.followers AS gain
FROM  (SELECT * FROM cte WHERE rnk = 1) d1
JOIN  (SELECT * FROM cte WHERE rnk = 2) d2 USING (username)
ORDER  BY gain DESC
FETCH  FIRST 1 ROWS WITH TIES;

db<>fiddle here

Detailed explanation for WITH TIES:

Greater than or equal to ALL() and equal to MAX() speed

Related Solutions

Postgresql – Difference between return next and return record

While RETURN NEXT is valid plpgsql syntax, ~~RETURN RECORD~~ is non-existent as such. It corresponds to how you declare the return type in the function header. Read the manual about how to return from a function.

In a function declaring RETURNS int you can:

RETURN 1;

For RETURNS SETOF int you can:

RETURN NEXT 1;

If you declare OUT parameters, you can just:

RETURN;
-- or do nothing: OUT params are returned at the end of the function anyway

Or if you combine OUT parameters with RETURNS SETOF record - effectively the same as RETURNS TABLE (...):

RETURN NEXT;

The current state of all OUT parameters is returned as next row.

Finally, to return a whole set of rows at once:

RETURN QUERY ...

record is a special data type. It's meaning depends on where you use it. Like I posted in my answer to your previous question:

Declare variable of table type in PL/pgSQL

The manual about the record type in PL/pgSQL.

Sql-server – the difference between COUNT() and COUNT() OVER()

COUNT(*) Returns all rows ignoring nulls right?

I'm not sure what you mean by "ignoring nulls" here. It returns the number of rows irrespective of any NULLs

SELECT COUNT(*)       
FROM (VALUES (CAST(NULL AS INT)),
             (CAST(NULL AS INT))) V(C)

Returns 2.

Altering the above query to COUNT(C) would return 0 as when using COUNT with an expression other than * only NOT NULL values of that expression are counted.

Suppose the table in your question has the following source data

+---------+---------------+
|  Name   | MaritalStatus |
+---------+---------------+
| Albert  | Single        |
| Bob     | Single        |
| Charles | Single        |
| David   | Single        |
| Edward  | Married       |
| Fred    | Married       |
| George  | NULL          |
+---------+---------------+

The query

SELECT MaritalStatus,
       COUNT(*) AS CountResult
FROM   T
GROUP  BY MaritalStatus

Returns

+---------------+-------------+
| MaritalStatus | CountResult |
+---------------+-------------+
| Single        |           4 |
| Married       |           2 |
| NULL          |           1 |
+---------------+-------------+

Hopefully it is obvious how that result relates to the original data.

What does COUNT(*) OVER() do?

Adding that into the SELECT list for the previous query produces

+---------------+-------------+-----------------+
| MaritalStatus | CountResult | CountOverResult |
+---------------+-------------+-----------------+
| Single        |           4 |               3 |
| Married       |           2 |               3 |
| NULL          |           1 |               3 |
+---------------+-------------+-----------------+

Notice that the result set has 3 rows and CountOverResult is 3. This is not a coincidence.

The reason for this is because it logically operates on the result set after the GROUP BY.

COUNT(*) OVER () is a windowed aggregate. The absence of any PARTITION BY or ORDER BY clause means that the window it operates on is the whole result set.

In the case of the query in your question the value of CountOverResult is the same as the number of distinct MaritalStatus values that exist in the base table because there is one row for each of these in the grouped result.

Best Answer

Related Solutions

Postgresql – Difference between return next and return record

Sql-server – the difference between COUNT(*) and COUNT(*) OVER()

Related Question

Sql-server – the difference between COUNT() and COUNT() OVER()