PostgreSQL Query Optimization – A Better Way to Write This Query?

performancepostgresqlpostgresql-performancerelational-divisionsubquery

I have the following schema for my database (Postgresql 10):

CREATE TABLE "PulledTexts" (
"Id" serial PRIMARY KEY,
"BaseText" TEXT,
"CleanText" TEXT
);

CREATE TABLE "UniqueWords" (
"Id" serial PRIMARY KEY,
"WordText" TEXT
);

CREATE TABLE "WordTexts" (
"Id" serial PRIMARY KEY,
"TextIdId" INTEGER REFERENCES "PulledTexts",
"WordIdId" INTEGER REFERENCES "UniqueWords"
);
CREATE INDEX "IX_WordTexts_TextIdId" ON "WordTexts" ("TextIdId");
CREATE INDEX "IX_WordTexts_WordIdId" ON "WordTexts" ("WordIdId");

Some sample data:

INSERT INTO public."PulledTexts" ("Id", "BaseText", "CleanText") VALUES
(1, 'automate business audit', null),
(2, 'audit trial', null),
(3, 'trial', null),
(4, 'audit', null),
(5, 'fresh report', null),
(6, 'fresh audit', null),
(7, 'automate this script', null),
(8, 'im trying here', null),
(9, 'automate this business', null),
(10, 'lateral', null);

INSERT INTO public."UniqueWords" ("Id", "WordText") VALUES
(1, 'trial'),
(2, 'audit'),
(3, 'creation'),
(4, 'business'),
(5, 'automate');

INSERT INTO public."WordTexts" ("Id", "TextIdId", "WordIdId") VALUES
(1, 1, 2),
(2, 1, 4),
(3, 1, 5),
(4, 2, 1),
(5, 3, 1),
(6, 4, 2),
(7, 6, 2),
(8, 7, 5),
(9, 9, 4),
(10, 9, 5),
(11, 2, 2);

The database itself is created through Entity framework migration for now.

I would like to know if there's a better, specifically more performance efficient way of writing this query because the WordTexts table will contain hundreds of thousands of records and eventually millions. I am also open to going a NoSql route if that would be more efficient for these kinds of queries.

SELECT *
FROM "PulledTexts"
WHERE "Id" IN (
 SELECT "TextIdId"
 FROM "WordTexts" AS "wordTexts"
 LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
 WHERE "wordTexts.WordId"."WordText" = 'automate'

 OR "TextIdId" IN (
  SELECT "TextIdId" and1
  from "WordTexts" AS "wordTexts"
  LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
  where "wordTexts.WordId"."WordText" = 'audit' INTERSECT

  SELECT "TextIdId" and2
  from "WordTexts" AS "wordTexts"
  LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
  WHERE "wordTexts.WordId"."WordText" = 'trial'
 )
);

At a high level, this query is supposed to return PulledTexts whose IDs match those returned by the subqueries. The subqueries are simply meant to return the list of IDs of PulledTexts that contain lets say ('audit' AND 'trial') OR 'automate' from the UniqueWords table. This is exactly what the example query I gave does. The WordTexts table is a simple mapping of UniqueWord to PulledText.

Best Answer

While your query is valid, there is a lot I would differently.

Don't use CaMeL-case names in Postgres if it can be avoided. Your unnamed entity framework may force this nonsense upon you, but I don't want to deal with the double-quote mess, so I tested with your schema after removing all double-quotes - effectively making all identifiers lower-cased.
- Are PostgreSQL column names case-sensitive?
Don't use illegible or illegal column and table aliases (like "wordTexts.WordId"). That's a matter of taste and style (and sanity), but you also omitted the keyword AS where you should not and kept it where you could omit it.
- Date column arithmetic in PostgreSQL query
I also formatted some more to make it easier for me to grok the query. That last part is totally optional. But use some consistent formatting style.

Arriving at this:

SELECT *
FROM   PulledTexts
WHERE  Id IN (
   SELECT w.TextIdId
   FROM   WordTexts w  -- AS can be omitted for table alias
   LEFT   JOIN UniqueWords u ON w.WordIdId = u.Id  -- LEFT JOIN might be necessary here
   WHERE  u.WordText = 'automate'
   OR     w.TextIdId IN (
      SELECT w.TextIdId  -- AS and1 -- column alias only documentation here, not visible
      FROM   WordTexts w
      JOIN   UniqueWords u ON w.WordIdId = u.Id  -- LEFT JOIN misleading here
      WHERE  u.WordText = 'audit'

      INTERSECT
      SELECT w.TextIdId  -- AS and2  -- but don't omit AS for column alias
      FROM   WordTexts w
      JOIN   UniqueWords u ON w.WordIdId = u.Id
      WHERE  u.WordText = 'trial'
      )
   );

Which can be simplified to:

SELECT *
FROM  (
   SELECT w.TextIdId AS Id
   FROM   WordTexts   w
   JOIN   UniqueWords u ON w.WordIdId = u.Id  -- now we don't need LEFT any more
   WHERE  u.WordText = 'automate'

   UNION
   SELECT w.TextIdId
   FROM   WordTexts w
   JOIN   UniqueWords u ON w.WordIdId = u.Id
   WHERE  u.WordText = 'audit'

   INTERSECT
   SELECT w.TextIdId
   FROM   WordTexts w
   JOIN   UniqueWords u ON w.WordIdId = u.Id
   WHERE  u.WordText = 'trial'
   ) w
JOIN   PulledTexts p USING (Id)

We do not need additional parentheses because, according to the manual:

INTERSECT binds more tightly than UNION. That is, A UNION B INTERSECT C will be read as A UNION (B INTERSECT C).

But this will be faster when replacing multiple intersected subqueries:

SELECT *
FROM  (
   SELECT w.TextIdId AS Id
   FROM   WordTexts   w
   JOIN   UniqueWords u ON w.WordIdId = u.Id
   WHERE  u.WordText = 'automate'

   UNION
   SELECT TextIdId
   FROM   WordTexts w1
   JOIN   WordTexts w2 USING (TextIdId)
   WHERE  w1.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'audit')
   AND    w2.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'trial')
   ) w
JOIN   PulledTexts p USING (Id)

The INTERSECT part can be cast as relational division problem. Explanation in this related answer from just yesterday:

Counting with multiple joins

db<>fiddle here

Most important for performance is to have the right indexes. You probably should have a UNIQUE constraint on (WordIdId, TextIdId) in table WordTexts, which implements the currently missing index on those two columns in this order.

Does creating a unique constraint on a Postgres column remove the need to index it?

UPDATE 2012-01-12 14:03 EDT

I refactored it again to make sure the readings keys and boards keys are combined correctly before retrieving the data from the readings table:

SELECT 
    readings.* 
FROM 
    ( 
        SELECT A.* FROM
        (
            SELECT boxsn FROM readings 
            WHERE (time >= 1325404800)  
            AND (time < 1326317400)  
            ORDER BY `time` ASC
        ) A
        LEFT JOIN
        (
            SELECT id AS boxsn
            FROM boards
            WHERE siteId = '1'
        ) B
        USING (boxsn)
        WHERE B.boxsn IS NOT NULL
    ) readings_keys 
    LEFT JOIN readings 
    USING (boxsn) 
;

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Best Answer

Related Solutions

Mysql – Query performance with subquery and IN clause

UPDATE 2012-01-12 14:03 EDT

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Related Question