Optimizing PostgreSQL Query with Large Quantity of Squid Access Requests

indexoptimizationpostgresql

Hello people, I'm using a log daemon (https://github.com/paranormal/blooper) in Squid Proxy to put access log into PostreSQL and I make a Trigger Function:

DECLARE
  newtime varchar := EXTRACT (MONTH FROM NEW."time")::varchar;
  newyear varchar := EXTRACT (YEAR FROM NEW."time")::varchar;
  user_name varchar := REPLACE (NEW.user_name, '.', '_');
  partname varchar := newtime || '_' ||  newyear;
  tablename varchar := user_name || '.accesses_' || partname;
BEGIN

  IF NEW.user_name IS NOT NULL THEN
    EXECUTE 'CREATE SCHEMA IF NOT EXISTS ' || user_name;

    EXECUTE 'CREATE TABLE IF NOT EXISTS '
    || tablename
    || '('
    || 'CHECK (user_name = ''' || NEW.user_name || ''' AND EXTRACT(MONTH FROM "time") = ' || newtime || ' AND EXTRACT (YEAR FROM "time") = ' || newyear || ')'
    || ') INHERITS (public.accesses)';

    EXECUTE 'CREATE INDEX IF NOT EXISTS access_index_' || partname || '_user_name ON ' || tablename || ' (user_name)';
    EXECUTE 'CREATE INDEX IF NOT EXISTS access_index_' || partname || '_time ON ' || tablename || ' ("time")';

    EXECUTE 'INSERT INTO ' || tablename || ' SELECT $1.*' USING NEW;
  END IF;

  RETURN NULL;
END;

The main function of it is make a table partition by user_name and by month-year of the access, inhering from a master clean table:

CREATE TABLE public.accesses
(
  id integer NOT NULL DEFAULT nextval('accesses_id_seq'::regclass),
  "time" timestamp with time zone NOT NULL,
  time_response integer,
  mac_source macaddr,
  ip_source inet NOT NULL,
  ip_destination inet,
  user_name character varying(40),
  http_status_code numeric(3,0) NOT NULL,
  http_reply_size bigint NOT NULL,
  http_request_method character varying(15) NOT NULL,
  http_request_url character varying(4166) NOT NULL,
  http_content_type character varying(100),
  squid_hier_code character varying(20),
  squid_request_status character varying(50),
  user_id integer,
  CONSTRAINT accesses_http_request_method_fkey FOREIGN KEY (http_request_method)
  REFERENCES public.http_requests (method) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT accesses_http_status_code_fkey FOREIGN KEY (http_status_code)
  REFERENCES public.http_statuses (code) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT accesses_user_id_fkey FOREIGN KEY (user_id)
  REFERENCES public.users (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION
)

The main problem is get the sum of http_reply_size grouping by user_name and time, my query is:

SELECT
  "time",
  user_name,
  sum(http_reply_size)
FROM
  accesses
WHERE
  extract(epoch from "time") BETWEEN 1516975122 AND 1516996722
GROUP BY
  "time",
  user_name

But this query is very slow in the server (3'237'976 rows currently in 2 days only). So, PostgreSQL has something to optimize a query with that need, or I need to use another SQL or NoSQL system.

Best Answer

You are representing time in 3 different was. The query is using epoch, while the index is using timestamptz, and the partition constraints are using year and month as two separate fields. So the query can make use of neither the index nor the partition constraints.

You should probably change them all to use timestamptz. In the query, instead of converting timestamptz to epoch on the column "time", convert epoch to timestamptz for the BETWEEN constants. (Or just have the client send timestamptz rather than epoch in the first place)

For the check constraint, you could use date_trunc("month",NEW."time") and date_trunc("month",NEW."time" + "1 month") to arrive at the endpoints to put into the check constraint. You would want to spell out the check constraint something like "time" >= low_limit and "time" < high_limit rather than using BETWEEN.