Postgresql – Grouping in long tables

group byindexjsonpostgresql

My application is running over Postgresql 9.4 and I found a problem and I don't
know how can I fix it.

I'm runing VoIp application, and we have a few tables that are too long (>60M of
rows) with all call details, I'm trying to make some reports over it, but some
queries are too slow and we don't know how can I do faster.

For example, one customer maybe have for a month 700K calls per month, and I
want to make some grouping around it, get the total cost, filter by json fields,
etc..

For that we use index, but in this case index are heavy, as I said a lot of
records that are similar (account_name) but with large index.

So the system that I use for billing has the following tables:

CREATE TABLE cdrs_primary (
  id SERIAL PRIMARY KEY,
  cgrid CHAR(40) NOT NULL,
  tor  VARCHAR(16) NOT NULL,
  accid VARCHAR(64) NOT NULL,
  cdrhost VARCHAR(64) NOT NULL,
  cdrsource VARCHAR(64) NOT NULL,
  reqtype VARCHAR(24) NOT NULL,
  direction VARCHAR(8) NOT NULL,
  tenant VARCHAR(64) NOT NULL,
  category VARCHAR(32) NOT NULL,
  account VARCHAR(128) NOT NULL,
  subject VARCHAR(128) NOT NULL,
  destination VARCHAR(128) NOT NULL,
  setup_time TIMESTAMP NOT NULL,
  pdd NUMERIC(12,9) NOT NULL,
  answer_time TIMESTAMP NOT NULL,
  usage NUMERIC(30,9) NOT NULL,
  supplier VARCHAR(128) NOT NULL,
  disconnect_cause VARCHAR(64) NOT NULL,
  created_at TIMESTAMP,
  deleted_at TIMESTAMP,
  UNIQUE (cgrid)
);
CREATE INDEX answer_time_idx ON cdrs_primary (answer_time);
CREATE INDEX deleted_at_cp_idx ON cdrs_primary (deleted_at);


CREATE TABLE cdrs_extra (
  id SERIAL PRIMARY KEY,
  cgrid CHAR(40) NOT NULL,
  extra_fields jsonb NOT NULL,
  created_at TIMESTAMP,
  deleted_at TIMESTAMP,
  UNIQUE (cgrid)
);
CREATE INDEX deleted_at_ce_idx ON cdrs_extra (deleted_at);

CREATE TABLE cost_details (
  id SERIAL PRIMARY KEY,
  cgrid CHAR(40) NOT NULL,
  runid  VARCHAR(64) NOT NULL,
  tor  VARCHAR(16) NOT NULL,
  direction VARCHAR(8) NOT NULL,
  tenant VARCHAR(128) NOT NULL,
  category VARCHAR(32) NOT NULL,
  account VARCHAR(128) NOT NULL,
  subject VARCHAR(128) NOT NULL,
  destination VARCHAR(128) NOT NULL,
  cost NUMERIC(20,4) NOT NULL,
  timespans jsonb,
  cost_source VARCHAR(64) NOT NULL,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  deleted_at TIMESTAMP,
  UNIQUE (cgrid, runid)
);
CREATE INDEX deleted_at_cd_idx ON cost_details (deleted_at);
CREATE TABLE rated_cdrs (
  id SERIAL PRIMARY KEY,
  cgrid CHAR(40) NOT NULL,
  runid  VARCHAR(64) NOT NULL,
  reqtype VARCHAR(24) NOT NULL,
  direction VARCHAR(8) NOT NULL,
  tenant VARCHAR(64) NOT NULL,
  category VARCHAR(32) NOT NULL,
  account VARCHAR(128) NOT NULL,
  subject VARCHAR(128) NOT NULL,
  destination VARCHAR(128) NOT NULL,
  setup_time TIMESTAMP NOT NULL,
  pdd NUMERIC(12,9) NOT NULL,
  answer_time TIMESTAMP NOT NULL,
  usage NUMERIC(30,9) NOT NULL,
  supplier VARCHAR(128) NOT NULL,
  disconnect_cause VARCHAR(64) NOT NULL,
  cost NUMERIC(20,4) DEFAULT NULL,
  extra_info text,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  deleted_at TIMESTAMP,
  UNIQUE (cgrid, runid)
);
CREATE INDEX deleted_at_rc_idx ON rated_cdrs (deleted_at);

In this case account it's customer name or id. My idea is to create a query where I can get the following information:

  • cdrs_primary.account
  • count(cdrs_primary.cdrhost)
  • sum(rated_cdrs.cost)
  • sum(rated_cdrs.usage)
  • direction

And this need to be filter by the following:
– cdrs_extra.extra_fields['connection']
– cost_details.timespans[0]['MatchedDestId']
– answer_time

My idea it's to use the following query:

select
    p.account,
    p.direction,
    date_trunc('day', r.answer_time),
    count(p.cdrhost) as host,
    sum(r.cost) as cost,
    sum(r.usage)/60 as minutes
from cdrs_primary p
join rated_cdrs r on p.cgrid = r.cgrid
join cost_details c on (r.cgrid=c.cgrid and r.runid=c.runid)
join cdrs_extra e on (e.cgrid=r.cgrid)
where
    r.setup_time > '2015-08-15'
    and e.extra_fields->>'connection' = '1'
    and c.timespans->0->>'MatchedDestId' IN ('UKN_FM1', 'UKN_FM2')
    and p.account='eloy'
group by
    p.account, p.direction, date_trunc('day', r.answer_time)

Here you have the explain:

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 HashAggregate  (cost=402230.95..402230.99 rows=2 width=40) (actual time=23854.251..23854.251 rows=0 loops=1)
   Group Key: p.account, p.direction, date_trunc('day'::text, r.answer_time)
   Buffers: shared hit=24645 read=151172
   ->  Nested Loop  (cost=317806.74..402230.92 rows=2 width=40) (actual time=23854.247..23854.247 rows=0 loops=1)
         Join Filter: (p.cgrid = c.cgrid)
         Buffers: shared hit=24645 read=151172
         ->  Nested Loop  (cost=317806.18..395160.27 rows=249 width=170) (actual time=23854.246..23854.246 rows=0 loops=1)
               Buffers: shared hit=24645 read=151172
               ->  Merge Join  (cost=317805.62..320484.41 rows=2643 width=106) (actual time=23854.244..23854.244 rows=0 loops=1)
                     Merge Cond: (p.cgrid = e.cgrid)
                     Buffers: shared hit=24645 read=151172
                     ->  Sort  (cost=196879.28..198155.65 rows=510548 width=65) (actual time=19980.899..20121.328 rows=518651 loops=1)
                           Sort Key: p.cgrid
                           Sort Method: quicksort  Memory: 85224kB
                           Buffers: shared hit=1 read=116096
                           ->  Bitmap Heap Scan on cdrs_primary p  (cost=11509.18..148475.03 rows=510548 width=65) (actual time=319.521..15620.892 rows=518651 loops=1)
                                 Recheck Cond: ((account)::text = 'eloy'::text)
                                 Heap Blocks: exact=114107
                                 Buffers: shared hit=1 read=116096
                                 ->  Bitmap Index Scan on cdrs_primary_account  (cost=0.00..11381.54 rows=510548 width=0) (actual time=270.386..270.386 rows=518651 loops=1)
                                       Index Cond: ((account)::text = 'eloy'::text)
                                       Buffers: shared read=1990
                     ->  Sort  (cost=120926.33..120976.15 rows=19928 width=41) (actual time=3247.967..3284.225 rows=144073 loops=1)
                           Sort Key: e.cgrid
                           Sort Method: quicksort  Memory: 17400kB
                           Buffers: shared hit=24644 read=35076
                           ->  Seq Scan on cdrs_extra e  (cost=0.00..119503.22 rows=19928 width=41) (actual time=2275.996..2358.434 rows=144073 loops=1)
                                 Filter: ((extra_fields ->> 'connection'::text) = '1'::text)
                                 Rows Removed by Filter: 3841475
                                 Buffers: shared hit=24644 read=35076
               ->  Index Scan using rated_cdrs_runid_cgrid_idx on rated_cdrs r  (cost=0.56..28.24 rows=1 width=64) (never executed)
                     Index Cond: (cgrid = p.cgrid)
                     Filter: (setup_time > '2015-08-15 00:00:00'::timestamp without time zone)
         ->  Index Scan using cost_details_cgrid_runid on cost_details c  (cost=0.56..28.38 rows=1 width=48) (never executed)
               Index Cond: ((cgrid = r.cgrid) AND ((runid)::text = (r.runid)::text))
               Filter: (((timespans -> 0) ->> 'MatchedDestId'::text) = ANY ('{UKN_FM1,UKN_FM2}'::text[]))
 Planning time: 4.432 ms
 Execution time: 23863.701 ms

I added the following index, but I think that didn't works at all, query it's
still slow.

cdrs_extra

"cdrs_extra_connection_idx" gin ((extra_fields -> 'connection'::text))

rated_cdrs:

"rated_cdrs_pkey" PRIMARY KEY, btree (id)
"cost_idx" btree (setup_time)
"deleted_at_rc_idx" btree (deleted_at)
"rated_cdrs_runid_cgrid_idx" btree (cgrid, runid)
"setup_time_idx" btree (setup_time DESC)

cost_details:

"cost_details_cgrid_runid" btree (cgrid, runid)
"cost_details_destination_key_idx" gin (((timespans -> 0) -> 'MatchDestId'::text))

The size of the index are the following:

  tablename   │            indexname             │  num_rows   │ table_size │ index_size
──────────────┼──────────────────────────────────┼─────────────┼────────────┼────────────
 cdrs_extra   │ cdrs_extra_connection_idx        │ 3.98555e+06 │ 467 MB     │ 4232 kB
 cdrs_extra   │ cdrs_extra_pkey                  │ 3.98555e+06 │ 467 MB     │ 88 MB
 cdrs_extra   │ deleted_at_ce_idx                │ 3.98555e+06 │ 467 MB     │ 159 MB
 cdrs_primary │ answer_time_idx                  │ 3.98555e+06 │ 1020 MB    │ 154 MB
 cdrs_primary │ cdrs_primary_account             │ 3.98555e+06 │ 1020 MB    │ 115 MB
 cdrs_primary │ cdrs_primary_pkey                │ 3.98555e+06 │ 1020 MB    │ 85 MB
 cdrs_primary │ deleted_at_cp_idx                │ 3.98555e+06 │ 1020 MB    │ 85 MB
 cost_details │ cost_details_cgrid_runid         │ 6.80088e+06 │ 7320 MB    │ 471 MB
 cost_details │ cost_details_destination_key_idx │ 6.80088e+06 │ 7320 MB    │ 8056 kB
 cost_details │ cost_details_pkey                │ 6.80088e+06 │ 7320 MB    │ 146 MB
 cost_details │ deleted_at_cd_idx                │ 6.80088e+06 │ 7320 MB    │ 146 MB
 rated_cdrs   │ cost_idx                         │ 7.74959e+06 │ 1595 MB    │ 166 MB
 rated_cdrs   │ deleted_at_rc_idx                │ 7.74959e+06 │ 1595 MB    │ 166 MB
 rated_cdrs   │ rated_cdrs_pkey                  │ 7.74959e+06 │ 1595 MB    │ 166 MB
 rated_cdrs   │ rated_cdrs_runid_cgrid_idx       │ 7.74959e+06 │ 1595 MB    │ 537 MB
 rated_cdrs   │ setup_time_idx                   │ 7.74959e+06 │ 1595 MB    │ 166 MB

Any idea to speed up this query?

Thanks

Best Answer

The first, most important, thing you can do is get rid of that use of json fields. If you know the field you'll be querying in advance, preferably make it a real column.

At the cost of inefficiently widening the table you can do that with a generated column using a trigger, then index the generated column. It'll hurt your insert performance a bit, but you have gin indexes, so you clearly don't mind insert performance too much.

An alternative is to create an expression index over the column, e.g.

CREATE INDEX cost_details_timespan_0_matcheddestid
ON cost_details((timespans->0->>'MatchedDestId'));

This is only useful if you always use exactly that expression in the query. To make it easier to match it can be a good idea to create a wrapper SQL function, like cost_details_get_first_matcheddestid(json), then use that in both the index and query.

Also look into partial indexes, for example, an index on

CREATE INDEX cdrs_extra_cgrid_when_conn1(cgrid)
WHERE (extra_fields->>'connection' = '1')

could be useful if you're always joining on the subset of cdrs_extra that has extra_fields->>'connection' = '1'.

Try to narrow the big tables as much as possible, moving unnecessary and heavily-duplicated columns out into secondary tables.

Finally, and especially if you can afford a delay before the data becomes available for analysis, consider batch-loading the data into a star schema, using fact tables. It sounds complicated and scary, but it's really not, and performance can be massively improved for some kinds of analysis work.