PostgreSQL – How to Assign Group IDs Depending on Content

gaps-and-islandspostgresqlwindow functions

I have the following table with data as shown below; linenrs are monotonically increasing but not necessarily consecutive; when the key field contains an ellipsis ... that indicates an entry that is continued from above:

create table source (
  linenr    integer unique not null,
  key       text not null,
  value     text );

insert into source values
  (  2, 'tags',  'a'          ),
  (  3, '...',   'b'          ),
  (  4, 'title', 'The Title'  ),
  (  5, 'note',  'this is'    ),
  (  6, '...',   'an EXAMPLE' ),
  (  8, 'title', 'over'       ),
  (  9, '...',   'three'      ),
  ( 10, '...',   'lines'      ),
  ( 11, 'about', 'grouping'   );

Now I want to have a view where a group number is assigned based on the content of the key field; group numbers do not have to be consecutive but should be distinct for each group of rows that starts with a key other than ... and continues thru all lines where key is ..., like this:

╔════════╤═══════╤═══════╤════════════╗
║ linenr │ group │  key  │   value    ║
╠════════╪═══════╪═══════╪════════════╣
║      2 │     1 │ tags  │ a          ║
║      3 │     1 │ ...   │ b          ║
║      4 │     2 │ title │ The Title  ║
║      5 │     3 │ note  │ this is    ║
║      6 │     3 │ ...   │ an EXAMPLE ║
║      8 │     4 │ title │ over       ║
║      9 │     4 │ ...   │ three      ║
║     10 │     4 │ ...   │ lines      ║
║     11 │     5 │ about │ grouping   ║
╚════════╧═══════╧═══════╧════════════╝

I tried to do this with windows / partitions and the tabibitosan pattern, but haven't been able to come up with anything working; also, using at the preceding row with lag() doesn't help where there are multiple consecutive lines with .... In a spreadsheet this is an extremely easy thing to do but in SQL I can seemingly not refer to the previous row of the current query, can I?

Best Answer

create table source (
  linenr    integer unique not null,
  key       text not null,
  value     text );

insert into source values
  (  2, 'tags',  'a'          ),
  (  3, '...',   'b'          ),
  (  4, 'title', 'The Title'  ),
  (  5, 'note',  'this is'    ),
  (  6, '...',   'an EXAMPLE' ),
  (  8, 'title', 'over'       ),
  (  9, '...',   'three'      ),
  ( 10, '...',   'lines'      ),
  ( 11, 'about', 'grouping'   );

You can get it using a simple CASE.

select linenr, key, value,
       case when key <> '...' then 1 end as rst
from   source;

linenr | key   | value      |  rst
-----: | :---- | :--------- | ---:
     2 | tags  | a          |    1
     3 | ...   | b          | null
     4 | title | The Title  |    1
     5 | note  | this is    |    1
     6 | ...   | an EXAMPLE | null
     8 | title | over       |    1
     9 | ...   | three      | null
    10 | ...   | lines      | null
    11 | about | grouping   |    1

Now SUM(rst) ignores null values and do the trick.

select linenr, key, value,
       sum(rst) over (order by linenr) grp
from (
      select linenr, key, value,
             case when key <> '...' then 1 end as rst
      from   source
     ) x;

linenr | key   | value      | grp
-----: | :---- | :--------- | --:
     2 | tags  | a          |   1
     3 | ...   | b          |   1
     4 | title | The Title  |   2
     5 | note  | this is    |   3
     6 | ...   | an EXAMPLE |   3
     8 | title | over       |   4
     9 | ...   | three      |   4
    10 | ...   | lines      |   4
    11 | about | grouping   |   5

dbfiddle here

Related Solutions

Single Table vs Multiple Table Design for Distinct Records Queried Together in PostgreSQL

I won't put all of them together. Even if you query them together, it may occur that your application will be able to post multiple images per post or multiple events per post or multiple posts per event.

In this case you would be able to save space. The JSON may be a good idea, but it would be better (in view of indices) to use multiple tables. This way you can provide indices which may improve the query process.

Another idea could be to use permutation. Which means you have a global table which holds many different types with the same structure. For example:

CREATE TABLE types(
    id int,
    type nvarchar(50),
    property1 int,
    property2 nvarchar(50)
)

This way you can store a value with Type = 'Image' which holds a property1 (order number) and a property2 (a path). And additionally you can hold an event like Type = 'Event' which holds the date in property1 and the name in property2. Just as an example.

How to Merge Similar Records with Different Validity Dates in SQL Server

If this is a table of back-to-back ranges only, your case can be treated as a classic "gaps and islands" problem, where you just need to isolate islands of consecutive ranges and then "condense" them by taking the minimum [from] and the maximum [to] per island.

There is an established method of solving this using two ROW_NUMBER calls:

WITH islands AS
(
  SELECT
    id,
    data,
    [from],
    [to],
    island = ROW_NUMBER() OVER (PARTITION BY id       ORDER BY [from])
           - ROW_NUMBER() OVER (PARTITION BY id, data ORDER BY [from])
  FROM
    #mergeTest
)
SELECT
  id,
  data,
  [from] = MIN([from]),
  [to]   = MAX([to])
FROM
  islands
GROUP BY
  id,
  data,
  island
;

This query will work in as low version as SQL Server 2005.

Best Answer

Related Solutions

Single Table vs Multiple Table Design for Distinct Records Queried Together in PostgreSQL

How to Merge Similar Records with Different Validity Dates in SQL Server

Related Question