Finding “islands” of holiday periods for employees

gaps-and-islandsoracleoracle-11g-r2

I'm looking for some tips to find a nice solution for my problem:

Let's say I have a table, where days of vacation of employees are stored. e.g.

CREATE TABLE test_dates (
  id             NUMBER        NOT NULL,
  holidaydate    DATE          NOT NULL,
  holidaycomment VARCHAR2(255) NULL
);

INSERT INTO test_dates VALUES (1, To_Date('17.08.2015','dd.mm.rrrr'), 'Trip to Berlin');
INSERT INTO test_dates VALUES (2, To_Date('18.08.2015','dd.mm.rrrr'), 'Trip to Berlin');
INSERT INTO test_dates VALUES (3, To_Date('19.08.2015','dd.mm.rrrr'), 'Trip to Berlin');
INSERT INTO test_dates VALUES (4, To_Date('20.08.2015','dd.mm.rrrr'), 'Trip to Berlin');
INSERT INTO test_dates VALUES (5, To_Date('21.08.2015','dd.mm.rrrr'), 'Stay at home');
INSERT INTO test_dates VALUES (6, To_Date('03.09.2015','dd.mm.rrrr'), 'Trip to London');
INSERT INTO test_dates VALUES (7, To_Date('04.09.2015','dd.mm.rrrr'), 'Trip to London');
INSERT INTO test_dates VALUES (8, To_Date('07.09.2015','dd.mm.rrrr'), 'Trip to London');
INSERT INTO test_dates VALUES (9, To_Date('08.09.2015','dd.mm.rrrr'), 'Trip to London');
INSERT INTO test_dates VALUES (10, To_Date('17.09.2015','dd.mm.rrrr'), NULL);

SELECT * FROM test_dates ORDER BY holidaydate;

Fine. Now I want to query this table to get the whole period of their holidays.

SELECT
  *
FROM (
  SELECT
    connect_by_root(id) AS id,
    connect_by_root(holidaycomment) AS holidaycomment,
    connect_by_root(holidaydate) AS date_from,
    Decode(connect_by_root(holidaydate), holidaydate, NULL, holidaydate) AS date_to,
    CONNECT_BY_ISLEAF AS is_leaf,
    LEVEL AS days
  FROM test_dates
  CONNECT BY PRIOR holidaydate = holidaydate - 1 AND (PRIOR holidaycomment = holidaycomment OR (PRIOR holidaycomment IS NULL AND holidaycomment IS NULL))
)
WHERE is_leaf = 1
ORDER BY date_from;

That query gave me 10 rows, a way too much (August 17th to 20th, August 18th to 20th and so on). So I'm trying to select the starting dates:

SELECT
  id, holidaycomment, date_from, date_to, days
FROM (
  SELECT
    connect_by_root(id) AS id,
    connect_by_root(holidaycomment) AS holidaycomment,
    connect_by_root(holidaydate) AS date_from,
    Decode(connect_by_root(holidaydate), holidaydate, NULL, holidaydate) AS date_to,
    CONNECT_BY_ISLEAF AS is_leaf,
    LEVEL AS days
  FROM test_dates
  -- NEW Code --
  START WITH id IN (
    SELECT t1.id
    FROM test_dates t1
    LEFT JOIN test_dates t2 ON t1.holidaydate - 1 = t2.holidaydate AND (t1.holidaycomment = t2.holidaycomment OR (t1.holidaycomment IS NULL AND t2.holidaycomment IS NULL))
    WHERE t2.id IS NULL
  )
  -- NEW Code --
  CONNECT BY PRIOR holidaydate = holidaydate - 1 AND (PRIOR holidaycomment = holidaycomment OR (PRIOR holidaycomment IS NULL AND holidaycomment IS NULL))
)
WHERE is_leaf = 1
ORDER BY date_from;
ID  HOLIDAYCOMMENT     DATE_FROM     DATE_TO      DAYS
------------------------------------------------------
1   Trip to Berlin     17.08.2015    20.08.2015   4 
5   Stay at home       21.08.2015    NULL         1
6   Trip to London     03.09.2015    04.09.2015   2
8   Trip to London     07.09.2015    08.09.2015   2
10  NULL               17.09.2015    NULL         1

It works, but it seems to be too complicated for me. Is there a better way for select the date_from?

And my last problem: My/The Application stores no information on weekend (See Ids 6 and 8). So I get 2 rows for one holiday.

Anyone an idea to join these rows with a query and no stored function? I'm looking for this result:

ID  HOLIDAYCOMMENT     DATE_FROM     DATE_TO      DAYS
------------------------------------------------------
1   Trip to Berlin     17.08.2015    20.08.2015   4 
5   Stay at home       21.08.2015    NULL         1
6   Trip to London     03.09.2015    08.09.2015   6
10  NULL               17.09.2015    NULL         1

Best Answer

This is my attempt at solving this "groups and islands" problem, with comments in-line:

WITH GroupStart AS
(
    -- Identify the start of each new group
    SELECT 
        TD.id, TD.holidaydate, TD.holidaycomment,
            -- New group if the holidaycomment has changed
            -- (in holidaydate order)
            CASE WHEN holidaycomment =
                    LAG(TD.holidaycomment) OVER (
                        ORDER BY TD.holidaydate)
                THEN 0
                ELSE 1
            END AS gs
    FROM test_dates TD
),
Islands AS
(
    -- The island identifier is the number
    -- of new groups encountered so far
    SELECT 
        id, holidaydate, holidaycomment,  
        SUM(GS.gs) OVER (ORDER BY GS.holidaydate) AS grp
    FROM GroupStart GS
)
-- Results are now a pretty trivial aggregation
SELECT
    MIN(I.id) AS id -- Lowest id in group,
    MIN(I.holidaycomment) AS holidaycomment -- All the same per group,
    MIN(I.holidaydate) AS date_from -- First date in group,
    NULLIF(MAX(I.holidaydate), MIN(I.holidaydate)) AS date_to,
    1 + MAX(I.holidaydate) - MIN(I.holidaydate) AS days
FROM islands I
GROUP BY
    I.grp
ORDER BY
    I.grp,
    date_from;

SQLFiddle

I wrote this as my first Oracle query for about 20 years. The translation above is from my native SQL Server.