I have the following schema for my database (Postgresql 10):
CREATE TABLE "PulledTexts" (
"Id" serial PRIMARY KEY,
"BaseText" TEXT,
"CleanText" TEXT
);
CREATE TABLE "UniqueWords" (
"Id" serial PRIMARY KEY,
"WordText" TEXT
);
CREATE TABLE "WordTexts" (
"Id" serial PRIMARY KEY,
"TextIdId" INTEGER REFERENCES "PulledTexts",
"WordIdId" INTEGER REFERENCES "UniqueWords"
);
CREATE INDEX "IX_WordTexts_TextIdId" ON "WordTexts" ("TextIdId");
CREATE INDEX "IX_WordTexts_WordIdId" ON "WordTexts" ("WordIdId");
Some sample data:
INSERT INTO public."PulledTexts" ("Id", "BaseText", "CleanText") VALUES
(1, 'automate business audit', null),
(2, 'audit trial', null),
(3, 'trial', null),
(4, 'audit', null),
(5, 'fresh report', null),
(6, 'fresh audit', null),
(7, 'automate this script', null),
(8, 'im trying here', null),
(9, 'automate this business', null),
(10, 'lateral', null);
INSERT INTO public."UniqueWords" ("Id", "WordText") VALUES
(1, 'trial'),
(2, 'audit'),
(3, 'creation'),
(4, 'business'),
(5, 'automate');
INSERT INTO public."WordTexts" ("Id", "TextIdId", "WordIdId") VALUES
(1, 1, 2),
(2, 1, 4),
(3, 1, 5),
(4, 2, 1),
(5, 3, 1),
(6, 4, 2),
(7, 6, 2),
(8, 7, 5),
(9, 9, 4),
(10, 9, 5),
(11, 2, 2);
The database itself is created through Entity framework migration for now.
I would like to know if there's a better, specifically more performance efficient way of writing this query because the WordTexts
table will contain hundreds of thousands of records and eventually millions. I am also open to going a NoSql route if that would be more efficient for these kinds of queries.
SELECT *
FROM "PulledTexts"
WHERE "Id" IN (
SELECT "TextIdId"
FROM "WordTexts" AS "wordTexts"
LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
WHERE "wordTexts.WordId"."WordText" = 'automate'
OR "TextIdId" IN (
SELECT "TextIdId" and1
from "WordTexts" AS "wordTexts"
LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
where "wordTexts.WordId"."WordText" = 'audit' INTERSECT
SELECT "TextIdId" and2
from "WordTexts" AS "wordTexts"
LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
WHERE "wordTexts.WordId"."WordText" = 'trial'
)
);
At a high level, this query is supposed to return PulledTexts whose IDs match those returned by the subqueries. The subqueries are simply meant to return the list of IDs of PulledTexts that contain lets say ('audit' AND 'trial') OR 'automate'
from the UniqueWords table. This is exactly what the example query I gave does. The WordTexts table is a simple mapping of UniqueWord to PulledText.
Best Answer
While your query is valid, there is a lot I would differently.
Don't use CaMeL-case names in Postgres if it can be avoided. Your unnamed entity framework may force this nonsense upon you, but I don't want to deal with the double-quote mess, so I tested with your schema after removing all double-quotes - effectively making all identifiers lower-cased.
Don't use illegible or illegal column and table aliases (like
"wordTexts.WordId"
). That's a matter of taste and style (and sanity), but you also omitted the keywordAS
where you should not and kept it where you could omit it.I also formatted some more to make it easier for me to grok the query. That last part is totally optional. But use some consistent formatting style.
Arriving at this:
Which can be simplified to:
We do not need additional parentheses because, according to the manual:
But this will be faster when replacing multiple intersected subqueries:
The
INTERSECT
part can be cast as relational division problem. Explanation in this related answer from just yesterday:db<>fiddle here
Most important for performance is to have the right indexes. You probably should have a
UNIQUE
constraint on(WordIdId, TextIdId)
in tableWordTexts
, which implements the currently missing index on those two columns in this order.