PostgreSQL – Using NTILE() Partition Function

aggregategroup bypostgresqlwindow functions

i have a point grid with population values. each point has an id and population value. i also have an state_id which say in what state the point is.

now i want to calculate the percentile ntile(100) for each state.

   SELECT  id, population, state_id, 
   ntile(100) OVER(ORDER BY car20) as percentile
   FROM avi_threshold01

when i use this, i think it calculates the ntile over all points and states.

Best Answer

If you want percentiles per state, then use PARTITION BY state_id in the OVER clause.

And that GROUP BY looks spurious at least. I think it needs to be removed if you want percentiles. And group by the PK is a no-operation anyway.

   SELECT  
       id, population, state_id, 
       ntile(100) OVER (PARTITION BY state_id ORDER BY car20) AS percentile
   FROM 
       avi_threshold01 ;

Related Solutions

postgresql,performance,postgresql-performance – Scaling ntile(N) Over a Set Where M < N

Starting with PostgreSQL 9.4 there's a Standard SQL percentile_disc aggregate function:

percentile_cont(0.25) WITHIN GROUP (ORDER BY column1), 
percentile_cont(0.5 ) WITHIN GROUP (ORDER BY column1), 
percentile_cont(0.75) WITHIN GROUP (ORDER BY column1),

If you can't upgrade on your Heroku instance you can rewrite it. I did that a few years ago on Teradata, besides the proprietary QUALIFY it's similar syntax:

According to SQL:2008 PERCENTILE_DISC(x) is the first value with a CUME_DIST greater than or equal to x. This directly translates to the row with a

ROW_NUMBER() OVER (PARTITION BY groupcol ORDER BY ordercol)
= CEILING(COUNT(*) OVER (PARTITION BY groupcol) * x 


with 
  v as (
    select * from (values 
      (1),(2),(3),(4),(5),(6),(7),(8),(9),
      (10),(11),(12),(13),(14),(15),(16),(17),(18),(19),
      (20),(21),(22),(23),(24),(25),(26),(27),(28),(29),
      (30),(31),(32),(33),(34),(35),(36),(37),(38),(39),
      (40),(41),(42),(43),(44),(45),(46),(47),(48),(49),
      (50),(51),(52),(53),(54),(55),(56),(57),(58),(59),
      (60),(61),(62),(63),(64),(65),(66),(67),(68),(69),
      (70),(71),(72),(73),(74),(75),(76),(77),(78),(79),
      (80),(81),(82),(83),(84),(85),(86),(87),(88),(89)
    ) as v 
  ),
  p as ( 
    select 
      column1, 
      row_number() OVER (ORDER BY column1 ASC) as rn,
      count(*) OVER () as cnt
    from v
  )
 select * 
 from p
 where rn in (CEILING(cnt * 0.25)
             ,CEILING(cnt * 0.5)
             ,CEILING(cnt * 0.75))

See Fiddle

PostgreSQL 9.3.5 Percentile Rank – How to Use Sorted Argument

In Postgres 9.4+, you can use the hypothetical-set aggregate function percent_rank() in a LATERAL join like this:

SELECT *
FROM   daily_data d, LATERAL (
   SELECT round(percent_rank(d.daily_val) WITHIN GROUP (ORDER BY daily_val)::numeric
              , 6) AS pctl_calc
   FROM   daily_data
   WHERE  company_id = d.company_id
   AND    trade_date < d.trade_date
   ) x
ORDER  BY company_id, trade_date;

The CROSS JOIN LATERAL (, LATERAL for short) is safe, because a subquery with an aggregate function always returns a row. (Else you might want LEFT JOIN LATERAL (...) x ON true.)

Minor difference: The first row in each group returns 0, not 1 like in your example. Easily fixed, if you need it.

This would do the same job in Postgres 9.3:

SELECT d.*
     , round(CASE WHEN x.ct = 0 THEN numeric '1'  -- or 0 ?
                  ELSE x.ct_lt / x.ct END, 6) AS pctl_calc
FROM   daily_data d, LATERAL (
   SELECT count(daily_val) AS ct
        , count(daily_val < d.daily_val OR NULL)::numeric As ct_lt
   FROM   daily_data
   WHERE  company_id = d.company_id
   AND    trade_date < d.trade_date
   ) x
ORDER  BY company_id, trade_date;

I use count(daily_val) for the full count, not count(*) to ignore NULL values - necessary if daily_val can be NULL, .

SQL Fiddle.

Or, without LATERAL join for even older versions. May be faster, too:

SELECT company_id, trade_date, daily_val, pctl
     , round(CASE WHEN ct = 0 THEN numeric '1'  -- 0 ?
                  ELSE ct_lt / ct END, 6) AS pctl_calc
FROM  (
   SELECT d.*
        , count(x.daily_val) AS ct
        , count(x.daily_val < d.daily_val OR NULL)::numeric AS ct_lt
   FROM   daily_data d
   LEFT   JOIN daily_data x ON x.company_id = d.company_id
                           AND x.trade_date < d.trade_date
   GROUP  BY d.id  -- pg 9.1+
   ) sub
ORDER  BY company_id, trade_date;

Assuming id is the PK for a simplified GROUP BY, requires pg 9.1 though.

Best Answer

Related Solutions

postgresql,performance,postgresql-performance – Scaling ntile(N) Over a Set Where M < N

PostgreSQL 9.3.5 Percentile Rank – How to Use Sorted Argument

Related Question