Postgresql – Slow postgres query that uses quick sort instead of hash-aggregate

group byoptimizationperformancepostgresqlquery-performance

I'm on postgres 9.5 and working with a web-analytics database that records visitor traffic.

I'm trying to optimize a very slow query gives me a count of the unique people that visited a given page grouped by their session device type.

The query looks like so:

select   groupname,  count(person_id) as thecount
from (
  select distinct S.first_device_type as groupname, A.person_id
  from event_page as O
  join alias as A
    on (O.person_alias = A.alias)
  left outer join session as S
    on (O.session_id = S.session_id) 
  join alias as A1
    on ( A1.alias = S.person_alias and A.person_id = A1.person_id)
  where O.timestamp_ between timestamp '2017-02-11 23:22:20.146' and timestamp '2017-03-13 23:22:20.146'
    and O.location_host = 'www.foo.bar.ca'
    and S.first_seen between timestamp '2017-02-11 23:22:20.146' and timestamp '2017-03-13 23:22:20.146'
) as alias_120134400
group by groupname

The above SQL runs in over 2.5 minutes and the output looks like this.

device type        count
--------------------------
Computer       |  163304
Game console   |      41
Mobile         |   33519
Tablet         |   10465
Unknown        |       5

There are couple peculiarities about the schema and above query I need to point out:

  • The event_page table is a union of monthly tables
  • Joining the alias table twice is necessary because an alias may change during a session. aliases to personIDs is many to one. Hence the page view alias and session alias may be different but point to the same personID.
  • All statistics are up to date (everything vacuum analyzed)!

As the explain below shows it appears that the Quick Sort is the bulk of my problem:

https://explain.depesz.com/s/6JjQ

In addition to that there is seq scan occurring (line 23 in the explain) on the session table yet that condition should be easily handled by the index.

I've run the same query on a different (albeit smaller) dastaset and the query plan seems more sensible. This one executes in only 3 seconds. Here is the explain:

https://explain.depesz.com/s/4k8t

Any ideas as to why the former explain chooses quick sort and the latter uses HashAaggregate to achieve the DISTINCT?

What can I do optimize the query and avoid quick sort?

Why (in both cases) is the planner choosing a seq scan of session when that filter is covered by an index?

Thanks in advance for any help.

Best Answer

You can see the flaw here,

Hash Cond: (((event_page_2016_8.person_alias)::text = (a.alias)::text) AND ((event_page_2016_8.session_id)::text = (s.session_id)::text))

It thinks that condition returns 2 rows. Instead it returns 11,861,365. If it returned 2 rows, a quick sort almost certainly would have been faster. The question is primarily why does it think that it's so far off there..

You may want to try to refresh the statistics (or even raising the statistics target for those tables) with ANALYZE event_page_2016_8 alias session, or see what could be returning so many rows to begin with.

Also this conditional doesn't make much sense to me

join alias as A
  on (O.person_alias = A.alias)
left outer join session as S
  on (O.session_id = S.session_id) 
join alias as A1
  on ( A1.alias = S.person_alias and A.person_id = A1.person_id)

Maybe I'm missing something. So you're self-joining A to A1 on person_id but also S.person_alias which has to share a session_id.. Why not just

join alias as A
  on (O.person_alias = A.alias)
join session as S
  USING (session_id)

And leave out A1 entirely? You're not using it in the SELECT list?