Postgresql – Slow query / Indexes creation (PostgreSQL 9.2)

indexoptimizationperformancepostgresqlquery-performance

I have the following Query:

explain analyze
SELECT split_part(full_path, '/', 4)::INT AS account_id,
       split_part(full_path, '/', 6)::INT AS note_id,
       split_part(full_path, '/', 9)::TEXT AS variation,
       st_size,
       segment_index,
       reverse(split_part(reverse(full_path), '/', 1)) as file_name,
       i.st_ino,
       full_path,
       (i.st_size / 1000000::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
FROM gorfs.inodes i
JOIN gorfs.inode_segments s
  ON i.st_ino = s.st_ino_target
WHERE
      i.checksum_md5 IS NOT NULL
  AND s.full_path ~ '^/userfiles/account/[0-9]+/[a-z]+/[0-9]+'
  AND i.st_size > 0;
  split_part(s.full_path, '/', 4)::INT IN (

SELECT account.id
        FROM public.ja_clients AS account
        WHERE
        NOT (
                ((account.last_sub_pay > EXTRACT('epoch' FROM (transaction_timestamp() - CAST('4 Months' AS INTERVAL)))) AND (account.price_model > 0)) OR
                (account.regdate > EXTRACT('epoch' FROM (transaction_timestamp() - CAST('3 Month' AS INTERVAL)))) OR
                (((account.price_model = 0) AND (account.jobcredits > 0)) AND (account.last_login > EXTRACT('epoch' FROM (transaction_timestamp() - CAST('4 Month' AS INTERVAL)))))
        ) LIMIT 100
);

Explain analyze link: http://explain.depesz.com/s/Oc6

The query is taking ages, and I can't get the problem solved.

These are the index I've already created on the inode_segments table:

Indexes:
    "ix_account_id_from_full_path" "btree" (("split_part"("full_path"::"text", '/'::"text", 4)::integer)) WHERE "full_path"::"text" ~ '^/userfiles/account/[0-9]+/[a-z]+/[0-9]+'::"text"
    "ix_inode_segments_ja_files_lookup" "btree" ((
CASE
    WHEN "full_path"::"text" ~ '/[^/]*\.[^/]*$'::"text" THEN "upper"("regexp_replace"("full_path"::"text", '.*\.'::"text", ''::"text", 'g'::"text"))
    ELSE NULL::"text"
END)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")
    "ix_inode_segments_notes_clientids" "btree" (("split_part"("full_path"::"text", '/'::"text", 4)::integer)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")
    "ix_inode_segments_notes_clientids2" "btree" ("full_path")
    "ix_inode_segments_notes_fileids" "btree" (("split_part"("full_path"::"text", '/'::"text", 8)::integer)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")
    "ix_inode_segments_notes_noteids" "btree" ((NULLIF("split_part"("full_path"::"text", '/'::"text", 6), 'unassigned'::"text")::integer)) WHERE "gorfs"."is_kaminski_note_path"("full_path"::"text")

These are the index I've already created on the inodes table:

 Indexes:
    "ix_inodes_checksum_st_size" "btree" ("checksum_md5", "st_size") WHERE "checksum_md5" IS NOT NULL

Question:

What else can I do to improve the Performance of the Query?

This is related to my previous question; index creation – Slow Query – PostgreSQL 9.2

UPDATE 1:

Explain analyze: http://explain.depesz.com/s/UBr

The index and function have been created as mentioned on the answer below.

UPDATE 2:

Explain analyze: http://explain.depesz.com/s/LHS

Using the Query provided on the answer below

Best Answer

Perhaps this will help.

If you'll rely on the account_id from full_path often, then you'll benefit from a function and a functional index for it:

CREATE OR REPLACE FUNCTION gorfs.f_get_account_from_full_path(p_full_path text) RETURNS int AS $body$
SELECT (regexp_matches($1, '^/userfiles/account/([0-9]+)/[a-z]+/[0-9]+'))[1]::int
$body$ LANGUAGE SQL IMMUTABLE SECURITY DEFINER RETURNS NULL ON NULL INPUT;

CREATE INDEX ON gorfs.inode_segments (gorfs.f_get_account_from_full_path(full_path));

Ensure gorfs.inodes has an index (or key much better if applicable) on st_ino!

You run the function split_part several times for each row, this is likely taking a significant toll. I've replaced it with string_to_array, and then fetch the individual pieces as needed. I also didn't understand what you intended to obtain for the field field_name using reverse? The query below returns the last element for it.

Your query returns many million rows. Even if PostgreSQL processes the query reasonably quickly, your client application (especially if you use PgAdminIII) will struggle allocating enough memory and receive and format the results, and probably be what takes the most time. So you may want to create a temporary table with the results, and then query against the temporary table:

CREATE TEMP TABLE myresults AS
WITH
  accounts AS (
    SELECT id
    FROM public.ja_clients
    WHERE NOT (
               (last_sub_pay > EXTRACT('epoch' FROM now() - '4 Months'::INTERVAL) AND price_model > 0) OR
               regdate > EXTRACT('epoch' FROM now() - '3 Month'::INTERVAL) OR
               (price_model = 0 AND jobcredits > 0 AND last_login > EXTRACT('epoch' FROM now() - '4 Month'::INTERVAL))
              )
    ORDER BY 1 LIMIT 100 -- first 100 accounts for testing purposes; comment out this line once the query is proven performant enough
    ) 
SELECT r.parts[4]::INT AS account_id, r.parts[6]::INT AS note_id, r.parts[9] AS variation,
       st_size, segment_index, r.parts[array_upper(r.parts, 1)] AS file_name, st_ino, full_path, size_mb
FROM (
  SELECT string_to_array(full_path, '/') AS parts, st_size, segment_index, i.st_ino, full_path,
         (i.st_size / 1000000::FLOAT)::NUMERIC(5,2) || 'MB' AS size_mb
  FROM gorfs.inode_segments s
  JOIN gorfs.inodes i ON (i.st_ino = s.st_ino_target)
  WHERE gorfs.f_get_account_from_full_path(s.full_path) IN (SELECT * FROM accounts)
    AND i.checksum_md5 IS NOT NULL
    AND i.st_size > 0
  ) r;

SELECT *
FROM myresults
LIMIT 100;

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

Since I like replacing aggregate functions by old-fashioned self-joins and NOT EXISTS clauses, here is my attempt:

SET search_path='tmp';

DROP TABLE tmp.changes CASCADE;
CREATE TABLE tmp.changes
        ( id integer NOT NULL PRIMARY KEY
        , fullname varchar
        , issuer varchar
        , rsymbol varchar
        , industry varchar
        , activity INTEGER NOT NULL
        , shareschange FLOAT
        , sharespchange FLOAT
        , mfiled FLOAT
        );

        -- lacking information from the OP
        -- I can only presume a flat distribution.
INSERT INTO tmp.changes(id, activity, shareschange,sharespchange,mfiled )
SELECT nm.*
        , (random() *20)::integer -- mfiled
        , random() *10000
        , random() *100
        , random() *100000
FROM generate_series(1,1000000) nm
        ;

ALTER TABLE tmp.changes
        ALTER shareschange
        SET STATISTICS 1000
        ;
ALTER TABLE tmp.changes
        ALTER mfiled
        SET STATISTICS 1000
        ;

VACUUM ANALYZE tmp.changes
        ;


CREATE INDEX changes_mfiled_shareschange
    ON tmp.changes(mfiled,shareschange)
        ;

EXPLAIN ANALYZE
SELECT initcap(ch.fullname) AS some_name1
     , initcap(ch.issuer) AS some_name2
     , upper(ch.rsymbol) AS some_name3
     , initcap(ch.industry) AS some_name4
     , ch.activity
     , to_char(ch.shareschange,'FM9,999,999,999,999,999') AS some_name5
     , ch.sharespchange || '%' AS some_name6
FROM   changes ch
WHERE  ch.activity IN (4,5)
        -- NOTE: the subquery is *not* correlated.
        -- [I had expected a subselect of nx.activity IN (4,5)
        -- like in the main query. ]
AND    NOT EXISTS (SELECT * FROM changes nx
        WHERE nx.mfiled > ch.mfiled
        )
ORDER  BY ch.shareschange ASC
LIMIT  15
        ;

Postgresql – Big jump in search time for Postgres Index query for results with high selectivity

As I understand your question you are asking why highly selective index scans might be much slower after a certain number of records are returned or after the table reaches a certain size. As it turns out your query plans provide most of the information needed. Understanding of course is the first step in trying to figure out how to solve the problem. It looks to me like your slow query is hitting a much more heavily used db than the fast query.

As I look at your query plans, I see two immediate pieces of bad news.

The first piece of bad news is in the buffer usage. Your rows returned are less than double, but you hit about six times the number of buffer pages. This is bad news because it means that PostgreSQL, following the index scan, is scanning through at least six times more information to find the rows to retrieve. Significantly worse, you shift from the rarely used db of mostly read buffers (fast, but few db-specific services) to shared buffers which require more overhead.

The second piece of bad news is that the recheck condition in the slow query weeds out about three times as many records as are returned. This tells me you need to vacuum and/or reindex this table.

I would recommend vacuum analyze and reindex first, followed if necessary by slightly lowering the shared buffer settings and see if this improves performance.

Best Answer

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

Postgresql – Big jump in search time for Postgres Index query for results with high selectivity

Related Question