Postgresql – Find most frequent values for a given column

aggregatedatabase-designdatepostgresql

I have a table that I would like as a leader-board for invitations as described below. I would like to create a query that counts the number of duplicate rows in a given month and order in a descending fashion.

Reading through some questions, this query seems to work:

SELECT COUNT(invite_code) AS counted
FROM invite_table
GROUP BY invite_code
ORDER BY counted DESC
LIMIT 10;

But it doesn't consider the month. What I am looking for is to get the most frequently appearing user_code where the month is specified. Also any criticism about the table design is welcome as I have deliberately designed it such that there are repeating rows with duplicate values. I am trying to track users whose invite code is used the most in a given month, I also have codes that indicate which channel a user comes from (maybe seeing an ad in FB for example), is this a valid table design?

Table "public.invite_table"
   Column    |         Type         | Collation | Nullable | Default
-------------+----------------------+-----------+----------+---------
 user_code   | character varying    |           | not null |
 invite_code | character varying    |           |          |
 month       | character varying(3) |           | not null |
 points      | integer              |           | not null |
Indexes:
    "invite_table_pkey" PRIMARY KEY, btree (user_code)
Foreign-key constraints:
    "invite_table_user_code_fkey" FOREIGN KEY (user_code) REFERENCES user_table(user_code)

Best Answer

get the most frequently appearing user_code where the month is specified

Since user_code is the primary key, that question would be nonsense. There can never be more than one. I assume you meant invite_code?

Just add a WHERE clause. And since the column can be NULL, also consider excluding NULL values:

SELECT invite_code, COUNT(*) AS counted
FROM   invite_table
WHERE  month = 'May'  -- or whatever is stored in your varchar(3) column
AND    invite_code IS NOT NULL -- exclude NULL
GROUP  BY invite_code
ORDER  BY counted DESC, invite_code  -- to break ties in deterministic fashion
LIMIT  10;

Month, date, timestamp?

A month column as varchar(3) doesn't seem very useful if there can be data for more than a single year. I would use data type date for it. You can format that with to_char() any way you like for presentation. Like:

SELECT to_char(date '2017-12-01', 'Mon');  -- 'Dec'

The column could look like this (also addressing your comment):

...
, inserted_at date DEFAULT CURRENT_DATE
...

The default value is entered when the column is omitted in an INSERT statement.

Or, if really only the month is relevant:

... DEFAULT date_trunc('month', now())::date

Or store the complete timestamptz (8 bytes, that's what I would probably do):

...
, inserted_at timestamptz DEFAULT now()
...

Read the manual here and here.

And be aware that date and timestamp depend on your current time zone setting. Details:

Ignoring timezones altogether in Rails and PostgreSQL

Related Solutions

Notes / description string in another table in a relational database

A coworker was suggesting that it is correct for a table to store 'description' and 'notes' as a foreign key rather than in the table for performance reasons, given that some of the records will contain NULL (or empty string in this case), and presumably will be split across pages.

Possibly, depending on RDBMS, the exact data types, and what the performance goal is.

For SQL Server, which it sounds like you're using, you can use sp_tableoption with large value types out of row to keep a LOB (nvarchar(MAX) in this case) column in the table you want (logically), while pushing the bulk of the data storage onto separate pages (physically).

This will make accessing the LOB data slower, but speed up access to only the base data (probably moreso for scanning) as the tradeoff. Depending on the average LOB size, this setting may not make a significant difference. You will have to test this for your exact scenario to see if it improves your workload.

In any event, as long as you want to keep the one-to-one relationship, there's no reason to logically separate the values from the base table because there is an available way around the problem in SQL Server, assuming you're using the MAX type variant. (Note: if you're currently using a fixed-length field, you could switch to the MAX type and add a check constraint to limit the length.)

This conversation occurred after we noticed that the integrity of the one-to-one table design was broken with a duplicate record.

It seems like the design of these structures is smelly. Are you keeping a single table to store all of these fields, which are then referenced from the parent tables?

A better design would be to only separate the tables where there's an actual need (again, as I mentioned, this may not even be required) with a declarative one-to-one relationship, such as repeating the primary key of the parent table as the primary key in the LOB storage table.

It seems to me that if the developer who designed the column put it in the source table, all of the code that handles the one-to-one integrity would be unnecessary (and would have not caused this bug)

Quite possibly. The fewer moving parts there are in a system, the more likely it is to be reliable, probably with fewer bugs as a result of complexity.

Postgresql – Find all values not present for other values of other column

If I interpret your question correctly:

"Pick all rows where a has one of two given values, and b only exists in combination with one of them."

You can use any of the standard techniques laid out in the referenced answer.

Select rows where value of second column is not present in first column

Just restrict your base table to the two given a. For instance

SELECT *
FROM   tbl t1
WHERE  a IN (1,2)
AND    NOT EXISTS (
   SELECT 1
   FROM   tbl t2
   WHERE  a IN (1,2)
   AND    t2.b =  t1.b
   AND    t2.a <> t1.a
   );

SQL Fiddle

This also works with duplicates on (a,b). Multiple identical rows would be returned.

Best Answer

Month, date, timestamp?

Related Solutions

Notes / description string in another table in a relational database

Postgresql – Find all values not present for other values of other column

Related Question