Postgresql – Removing Outliers Grouping by Minute

postgresqlquery

I have a very simple table storing the following fields per http request:

  • id serial
  • url varchar
  • date varchar (yyyy-mm-dd)
  • time varchar (hh:mm)
  • latency integer

And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.

So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.

My query is as follows:

select mdate,mtime,avg(latency)
from mtable 
  join (
    select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband" 
    from mtable 
    where url like '%ThePartialUrl%' 
    group by, mdate, mtime
 ) as t 
 on mdate=t.mdate and mtime=t.mtime and latency < t.uband 
 where url like '%ThePartialUrl%'
 group by mdate,mtime 
 order by mdate, mtime

However it is taking too much time.
Things I've already know/done:

  • url like requires a full scan regardless of index
  • mdate and mtime could be one column using datetime value
  • I have index for almost everything (mdate,mtime,latency)
  • ajusting work_mem parameters

The table itself has around 100M rows at total.

Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.

Anything that could help me improve it?

I am using postgres 10.

Thanks and Regards

Best Answer

It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.

I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:

select 
  mdate,
  mtime,
  percentile_disc(0.997) withing group (order by latency) as "response" 
from mtable
where url like '%MyPartialUrl% 
group by mdate,mtime
order by mdate,mtime

As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.

Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.

Thanks all. Regards