PostgreSQL: Merge rows with the same date and time but different numeric values in one row with average of the values

postgresql

I am working on a financial software where we have huge table of ticks with more than 800M rows.

A single row: symbol_id, broker_id, date, time, bid, ask

The problem is the table has many ticks with identical first four columns and different bid and ask.

What is the best way to update the table and combine such rows into one by taking an average of bid and ask columns?

Example:

|1|2|2012-01-01|12:00:00|1.3456|1.3678|

|1|2|2012-01-01|12:00:00|1.3463|1.3612|

Result row:

|1|2|2012-01-01|12:00:00|1.3459|1.3645|

Best Answer

|1|2|2012-01-01|12:00:00|1.3459|1.3645|

SELECT col1, col2, col3, col4, AVG(col5) col5, AVG(col6) col6
FROM table_name
GROUP BY col1, col2, col3, col4;

Related Solutions

PostgreSQL Query – Differentiate Between No Child Rows and No Parent Row in One SELECT

I think I managed to get a satisfactory solution using the following two features:

Subselect bound to a column, which allows me to check if a row exists and (importantly) get a NULL value otherwise (e.g. SELECT (SELECT id FROM a WHERE id = 1) as a_id))
Common Table Expressions

Initial data:

CREATE TABLE people 
(
  id integer not null primary key, 
  name text not null
);

CREATE TABLE thing_types 
(
  id integer not null primary key, 
  name text not null
);

CREATE TABLE things
(
  id integer not null primary key, 
  person_id integer not null references people(id), 
  thing_type_id integer not null references thing_types(id), 
  name text not null
);

INSERT INTO people VALUES (1, 'Bill');
INSERT INTO thing_types VALUES (1, 'game');

INSERT INTO things VALUES (1, 1, 1, 'Duke Nukem');
INSERT INTO things VALUES (2, 1, 1, 'Warcraft 2');

And the query:

WITH v AS (
  SELECT (SELECT id FROM people WHERE id=<person_id_param>) AS person_id, 
         (SELECT id FROM thing_types WHERE id=<thing_type_param>) AS thing_type_id
)
SELECT v.person_id, v.thing_type_id, things.name 
FROM 
  v LEFT JOIN things 
    ON v.person_id = things.person_id AND v.thing_type_id = things.thing_type_id

This query will always return at least one row. In case if both parent table ids are valid and there are some records, I get

person_id  thing_type_id  name
-------------------------------------
        1              1   Duke Nukem
        1              1   Warcraft 2

If either person_id or thing_type_id are invalid, I get one row where name is NULL and either person_id or thing_type_id is NULL:

person_id  thing_type_id  name
-------------------------------------
     NULL              1         NULL

If both person_id and thing_type_id are valid but there are no records in things, I get one row where both person_id and thing_type_id are not NULL, but the name is NULL:

person_id  thing_type_id  name
-------------------------------------
        1              1         NULL

Since I have a NOT NULL constraint on things.name, I know that this case can only mean that there are no matching records in things.

Postgresql – return continuous set of IDs based on two other columns

You can do something like this..

SELECT
  array_agg(id),
  code,
  segment,
  grp,
  count(*)
FROM (
  SELECT
    id,
    code,
    segment,
    count(is_reset) OVER (ORDER BY id) AS grp
  FROM (
    SELECT
      id,
      code,
      segment,
      CASE
        WHEN
          code <> lag(code) OVER (ORDER BY id)
          OR segment <> lag(segment) OVER (ORDER BY id)
        THEN 1
      END AS is_reset
    FROM foo
  ) AS t
) AS g
GROUP BY code, segment, grp;

   array_agg   | code | segment | grp | count 
---------------+------+---------+-----+-------
 {11}          |    1 |       2 |   6 |     1
 {10}          |    0 |       2 |   5 |     1
 {1,2,3}       |    0 |       0 |   0 |     3
 {7}           |    1 |       1 |   2 |     1
 {4,5,6}       |    0 |       1 |   1 |     3
 {16}          |    0 |       3 |   8 |     1
 {8}           |    0 |       1 |   3 |     1
 {12,13,14,15} |    0 |       2 |   7 |     4
 {9}           |    1 |       2 |   4 |     1
(9 rows)

To query this, just search for code = 0 AND count >= 4.

Explanation

First we generate a reset

SELECT
  id,
  code,
  segment,
  CASE
    WHEN
      code <> lag(code) OVER (ORDER BY id)
      OR segment <> lag(segment) OVER (ORDER BY id)
    THEN 1
  END AS is_reset
FROM foo

 id | code | segment | is_reset 
----+------+---------+----------
  1 |    0 |       0 |         
  2 |    0 |       0 |         
  3 |    0 |       0 |         
  4 |    0 |       1 |        1
  5 |    0 |       1 |         
  6 |    0 |       1 |         
  7 |    1 |       1 |        1
  8 |    0 |       1 |        1
  9 |    1 |       2 |        1
 10 |    0 |       2 |        1
 11 |    1 |       2 |        1
 12 |    0 |       2 |        1
 13 |    0 |       2 |         
 14 |    0 |       2 |         
 15 |    0 |       2 |         
 16 |    0 |       3 |        1
(16 rows)

Then we create groups,

  SELECT
    id,
    code,
    segment,
    count(is_reset) OVER (ORDER BY id) AS grp
  FROM (
    SELECT
      id,
      code,
      segment,
      CASE
        WHEN
          code <> lag(code) OVER (ORDER BY id)
          OR segment <> lag(segment) OVER (ORDER BY id)
        THEN 1
      END AS is_reset
    FROM foo
  ) AS t;
 id | code | segment | grp 
----+------+---------+-----
  1 |    0 |       0 |   0
  2 |    0 |       0 |   0
  3 |    0 |       0 |   0
  4 |    0 |       1 |   1
  5 |    0 |       1 |   1
  6 |    0 |       1 |   1
  7 |    1 |       1 |   2
  8 |    0 |       1 |   3
  9 |    1 |       2 |   4
 10 |    0 |       2 |   5
 11 |    1 |       2 |   6
 12 |    0 |       2 |   7
 13 |    0 |       2 |   7
 14 |    0 |       2 |   7
 15 |    0 |       2 |   7
 16 |    0 |       3 |   8
(16 rows)

Now you'll notice that the groups a fn(code,segment), reseting when either changes. Now we group by the grp, and do a count(*), but because you want the ids we run them through array_agg and store them as an array.

Note, you say and the sequence of IDs is not broken. I'm working off the assumption that this is not what you meant. PostgreSQL doesn't have a concept of gapless sequences, so that's not a good idea anyway. I assume you wanted what I've done.

Best Answer

Related Solutions

PostgreSQL Query – Differentiate Between No Child Rows and No Parent Row in One SELECT

Postgresql – return continuous set of IDs based on two other columns

Explanation

Related Question