Postgresql – Slow performance on averaging query

optimizationperformancepostgresqlquery-performance

I am fairly new to DBMS and backend management, and was hoping to get some advice on the following issue.

We have a Postgres database with the following tables: members and activity_scores. The activity_scores belong to each member, and are written to every 10 minutes. Here is an example of one of the rows of the activity_scores table (indexed on member_id):

 id | member_id | score |         created_at         |         updated_at         
----+-----------+-------+----------------------------+----------------------------
  1 |         8 |    73 | 2016-11-04 05:32:55.564235 | 2016-11-04 05:32:55.564235
  2 |        10 |    20 | 2016-11-04 05:22:55.564235 | 2016-11-04 05:22:55.564235

Size of activity_scores data per member could be 10,000's to millions.

The query we are trying to run essentially takes scores for a member, and runs the average grouped by the date:

SELECT TO_CHAR((activity_scores.created_at AT TIME ZONE ? AT TIME ZONE members.zone),
         ?) AS date,
         AVG(score) AS average_score
FROM "activity_scores"
INNER JOIN "members"
    ON "members"."id" = "activity_scores"."member_id"
WHERE "activity_scores"."member_id" = $1
GROUP BY  date

An example result:

    date    |     average_score      
------------+------------------------
 2016/10/15 | 52.00000000000000000000
 2016/10/29 | 60.25000000000000000000
 2016/09/05 | 70.05000000000000000000

These values are then used for graphing.

Unfortunately, the above query takes a long time (sometimes up to a few minutes), and causes the entire server (hosted on Heroku) to timeout.

There isn't really a need for us to have the data saved in 10 minute increments for months. Therefore, I've thought about making a separate table to just store the daily averages which we can poll directly from for graphing purposes. However this is somewhat of a duplicated data (which allegedly is a no-no in the DBM sense).

Would you guys have any suggestions on how to handle something like this?

Thank you!

Best Answer

If you are only inserting into the table and old rows are not deleted or updated and you don't insert rows for old dates, I don't see any harm at all from materializing the query results in another table. Every day, you could be aggregating the previous day's results for each member. Index properly the new table and efficiency should be good.

These are not duplicated data, they are summary results. Since efficiency of calculating them on the spot is not good enough, you are justified to pre-calculate and store them in a summary table.

Only if the members.zone changes, you would have to recalculate the results for that member.

Related Solutions

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

Sql-server – Query Time is doubled because of a column in select

The time difference is probably due to:

The larger amount of data that needs to be returned to the client; and
The FORMAT function is relatively slow

You may be able to avoid using FORMAT by using a T-SQL expression instead, for example:

REPLACE(CONVERT(char(11), dtutcDateTime, 106), SPACE(1), '-') + 
SPACE(1) + 
CONVERT(char(8), @dt, 108);

-- Returns 25-Mar-2016 23:45:19

See:

FORMAT() is nice and all, but… by Aaron Bertrand
CAST and CONVERT (Transact-SQL) in the product documentation

You should also avoid scalar T-SQL functions in general, for performance reasons. Both plans use a UDF called xPT_ConvertTimeToDDHHMMSS. Scalar T-SQL functions are executed per row, with an overhead similar to that of running a complete query (each time). With 121,861 function calls, that overhead will be adding up. Use an in-line function or T-SQL intrinsics instead.

Also check that the variable @sUrlHeader is not a LOB type (e.g. varchar(max)) if it does not need to be. Using varchar(8000) or below may be significantly faster.

You can check the raw performance of the query (discounting the effect of a client that is slow to accept the results) by running it in SSMS with the Discard Results option set:

...or by selecting the query into a temporary table:

SELECT
  OverspeedReportId = ol.iVehicleMonitoringId,
  AssetId = iAssetId,
  Registration = sReference,
  CategoryId = iCategoryId,
  CategoryName = sCategoryName,
  SiteId = iSiteId,
  SiteName = sSiteName,
  OverspeedDate = FORMAT(dtutcDateTime, 'dd-MMM-yyyy HH:mm:ss'),
  DistanceTraveledSinceLastOverSpeed = DistanceCoveredKM,
  TimeDifferenceDDDHHMMSS = SUBSTRING(dbo.xPT_ConvertTimeToDDHHMMSS(DiffSeconds,'s'),1,12),
  SpeedKM = fSpeed,
  [Address] = sState +', '+ sDistrict +', ' +sPoi +', ' + sRoad +', '+sPoi,
  MapUrl = @sUrlHeader+'/Report/ReportOnMap/?id='
      +CONVERT(VARCHAR(10), @iCompanyId)+'&ReportName=OverspeedReport'
      +'&AssetId='+ CONVERT(VARCHAR(10), iAssetId)
      +'&MaxSpeed='+ CONVERT(VARCHAR(10), fSpeed)      
      +'&OverspeedingDate=' + FORMAT(dtutcDateTime, 'dd-MMM-yyyy HH:mm:ss')
      +'&VehicleMonitoringLogId='+ CONVERT(VARCHAR, ol.iVehicleMonitoringId),
  [Locate] = 'Locate'
INTO #Results -- NEW!
FROM #overspeedLogs ol
LEFT JOIN VehicleGISAddressLog gis
    ON gis.iVehicleMonitoringId = ol.iVehicleMonitoringId
ORDER BY
    ol.iAssetId, 
    ol.dtUtcDateTime;

For the most comprehensive collection of performance data, run the queries directly from SQL Sentry Plan Explorer. Click the "Post to SQLPerformance.com" toolbar button to upload the complete session for expert analysis on that site.

If you are unable to do that, consider adding STATISTICS IO output to your question, at least.

Best Answer

Related Solutions

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Sql-server – Query Time is doubled because of a column in select

Related Question