Optimizing PostgreSQL Query with Large Quantity of Squid Access Requests

indexoptimizationpostgresql

Hello people, I'm using a log daemon (https://github.com/paranormal/blooper) in Squid Proxy to put access log into PostreSQL and I make a Trigger Function:

DECLARE
  newtime varchar := EXTRACT (MONTH FROM NEW."time")::varchar;
  newyear varchar := EXTRACT (YEAR FROM NEW."time")::varchar;
  user_name varchar := REPLACE (NEW.user_name, '.', '_');
  partname varchar := newtime || '_' ||  newyear;
  tablename varchar := user_name || '.accesses_' || partname;
BEGIN

  IF NEW.user_name IS NOT NULL THEN
    EXECUTE 'CREATE SCHEMA IF NOT EXISTS ' || user_name;

    EXECUTE 'CREATE TABLE IF NOT EXISTS '
    || tablename
    || '('
    || 'CHECK (user_name = ''' || NEW.user_name || ''' AND EXTRACT(MONTH FROM "time") = ' || newtime || ' AND EXTRACT (YEAR FROM "time") = ' || newyear || ')'
    || ') INHERITS (public.accesses)';

    EXECUTE 'CREATE INDEX IF NOT EXISTS access_index_' || partname || '_user_name ON ' || tablename || ' (user_name)';
    EXECUTE 'CREATE INDEX IF NOT EXISTS access_index_' || partname || '_time ON ' || tablename || ' ("time")';

    EXECUTE 'INSERT INTO ' || tablename || ' SELECT $1.*' USING NEW;
  END IF;

  RETURN NULL;
END;

The main function of it is make a table partition by user_name and by month-year of the access, inhering from a master clean table:

CREATE TABLE public.accesses
(
  id integer NOT NULL DEFAULT nextval('accesses_id_seq'::regclass),
  "time" timestamp with time zone NOT NULL,
  time_response integer,
  mac_source macaddr,
  ip_source inet NOT NULL,
  ip_destination inet,
  user_name character varying(40),
  http_status_code numeric(3,0) NOT NULL,
  http_reply_size bigint NOT NULL,
  http_request_method character varying(15) NOT NULL,
  http_request_url character varying(4166) NOT NULL,
  http_content_type character varying(100),
  squid_hier_code character varying(20),
  squid_request_status character varying(50),
  user_id integer,
  CONSTRAINT accesses_http_request_method_fkey FOREIGN KEY (http_request_method)
  REFERENCES public.http_requests (method) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT accesses_http_status_code_fkey FOREIGN KEY (http_status_code)
  REFERENCES public.http_statuses (code) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT accesses_user_id_fkey FOREIGN KEY (user_id)
  REFERENCES public.users (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION
)

The main problem is get the sum of http_reply_size grouping by user_name and time, my query is:

SELECT
  "time",
  user_name,
  sum(http_reply_size)
FROM
  accesses
WHERE
  extract(epoch from "time") BETWEEN 1516975122 AND 1516996722
GROUP BY
  "time",
  user_name

But this query is very slow in the server (3'237'976 rows currently in 2 days only). So, PostgreSQL has something to optimize a query with that need, or I need to use another SQL or NoSQL system.

Best Answer

You are representing time in 3 different was. The query is using epoch, while the index is using timestamptz, and the partition constraints are using year and month as two separate fields. So the query can make use of neither the index nor the partition constraints.

You should probably change them all to use timestamptz. In the query, instead of converting timestamptz to epoch on the column "time", convert epoch to timestamptz for the BETWEEN constants. (Or just have the client send timestamptz rather than epoch in the first place)

For the check constraint, you could use date_trunc("month",NEW."time") and date_trunc("month",NEW."time" + "1 month") to arrive at the endpoints to put into the check constraint. You would want to spell out the check constraint something like "time" >= low_limit and "time" < high_limit rather than using BETWEEN.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

PostgreSQL – Improve Full-Text Search Performance with Accurate Row Estimates

This can be improved in a thousand and one ways, then it should be a matter of milliseconds.

Better Queries

This is just your query reformatted with aliases and some noise removed to clear the fog:

SELECT count(DISTINCT t.id)
FROM   tickets      t
JOIN   transactions tr ON tr.objectid = t.id
JOIN   attachments  a  ON a.transactionid = tr.id
WHERE  t.status <> 'deleted'
AND    t.type = 'ticket'
AND    t.effectiveid = t.id
AND    tr.objecttype = 'RT::Ticket'
AND    a.contentindex @@ plainto_tsquery('frobnicate');

Most of the problem with your query lies in the first two tables tickets and transactions, which are missing from the question. I'm filling in with educated guesses.

t.status, t.objecttype and tr.objecttype should probably not be text, but enum or possibly some very small value referencing a look-up table.

`EXISTS` semi-join

Assuming tickets.id is the primary key, this rewritten form should be much cheaper:

SELECT count(*)
FROM   tickets t
WHERE  status <> 'deleted'
AND    type = 'ticket'
AND    effectiveid = id
AND    EXISTS (
   SELECT 1
   FROM   transactions tr
   JOIN   attachments  a ON a.transactionid = tr.id
   WHERE  tr.objectid = t.id
   AND    tr.objecttype = 'RT::Ticket'
   AND    a.contentindex @@ plainto_tsquery('frobnicate')
   );

Instead of multiplying rows with two 1:n joins, only to collapse multiple matches in the end with count(DISTINCT id), use an EXISTS semi-join, which can stop looking further as soon as the first match is found and at the same time obsoletes the final DISTINCT step. Per documentation:

The subquery will generally only be executed long enough to determine whether at least one row is returned, not all the way to completion.

Effectiveness depends on how many transactions per ticket and attachments per transaction there are.

Determine order of joins with `join_collapse_limit`

If you know that your search term for attachments.contentindex is very selective - more selective than other conditions in the query (which is probably the case for 'frobnicate', but not for 'problem'), you can force the sequence of joins. The query planner can hardly judge selectiveness of particular words, except for the most common ones. Per documentation:

join_collapse_limit (integer)

[...]
Because the query planner does not always choose the optimal join order, advanced users can elect to temporarily set this variable to 1, and then specify the join order they desire explicitly.

Use SET LOCAL for the purpose to only set it for the current transaction.

BEGIN;
SET LOCAL join_collapse_limit = 1;

SELECT count(DISTINCT t.id)
FROM   attachments  a                              -- 1st
JOIN   transactions tr ON tr.id = a.transactionid  -- 2nd
JOIN   tickets      t  ON t.id = tr.objectid       -- 3rd
WHERE  t.status <> 'deleted'
AND    t.type = 'ticket'
AND    t.effectiveid = t.id
AND    tr.objecttype = 'RT::Ticket'
AND    a.contentindex @@ plainto_tsquery('frobnicate');

ROLLBACK; -- or COMMIT;

The order of WHERE conditions is always irrelevant. Only the order of joins is relevant here.

Or use a CTE like @jjanes explains in "Option 2". for a similar effect.

Indexes

B-tree indexes

Take all conditions on tickets that are used identically with most queries and create a partial index on tickets:

CREATE INDEX tickets_partial_idx
ON tickets(id)
WHERE  status <> 'deleted'
AND    type = 'ticket'
AND    effectiveid = id;

If one of the conditions is variable, drop it from the WHERE condition and prepend the column as index column instead.

Another one on transactions:

CREATE INDEX transactions_partial_idx
ON transactions(objecttype, objectid, id)

The third column is just to enable index-only scans.

Also, since you have this composite index with two integer columns on attachments:

"attachments3" btree (parent, transactionid)

This additional index is a complete waste, delete it:

"attachments1" btree (parent)

Details:

Is a composite index also good for queries on the first field?

GIN index

Add transactionid to your GIN index to make it a lot more effective. This may be another silver bullet, because it potentially allows index-only scans, eliminating visits to the big table completely.
You need additional operator classes provided by the additional module btree_gin. Detailed instructions:

Inner join using an array column

"contentindex_idx" gin (transactionid, contentindex)

4 bytes from an integer column don't make the index much bigger. Also, fortunately for you, GIN indexes are different from B-tree indexes in a crucial aspect. Per documentation:

A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use.

Bold emphasis mine. So you just need the one (big and somewhat costly) GIN index.

Table definition

Move the integer not null columns to the front. This has a couple of minor positive effects on storage and performance. Saves 4 - 8 bytes per row in this case.

                      Table "public.attachments"
         Column      |            Type             |         Modifiers
    -----------------+-----------------------------+------------------------------
     id              | integer                     | not null default nextval('...
     transactionid   | integer                     | not null
     parent          | integer                     | not null default 0
     creator         | integer                     | not null default 0  -- !
     created         | timestamp                   |                     -- !
     messageid       | character varying(160)      |
     subject         | character varying(255)      |
     filename        | character varying(255)      |
     contenttype     | character varying(80)       |
     contentencoding | character varying(80)       |
     content         | text                        |
     headers         | text                        |
     contentindex    | tsvector                    |

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record