Postgresql – Optimizing query with type casting

database-designdatatypesperformancepostgresqlpostgresql-9.5postgresql-performance

I had a query that was taking way too much time despite having all the indexes I could think of.

Eventually, I realized that one of the JOIN .. ON inside the query was casting a column's content to another data type because the column from table A is of type varchar while the matching column of table B is of type integer.

I changed my code to use a temporary table in which I insert the rows I need from table A and cast the column from varchar to integer.

It improved the query speed by about a 1000x!

Table zone_site:

CREATE TABLE traitements.zones_sites (
    pkid serial PRIMARY KEY,
    pkid_site integer NOT NULL,
    origine varchar(50) NOT NULL, 
    origine_id varchar(255) NOT NULL,  -- can't be converted to int
    catégorie varchar(255) NOT NULL,
    horodatage timestamp NOT NULL,
    geom geometry(MultiPolygon,2154),
    précision_contour varchar(100),
    statut varchar(100),
    détails_jsonb text,
    CONSTRAINT enforce_dims_geomloc CHECK (ST_NDims(geom) = 2)
);

CREATE INDEX zones_sites_idx_pkid_site  ON traitements.zones_sites (pkid_site);
CREATE INDEX zones_sites_géométrie ON traitements.zones_sites USING GIST (geom);
CREATE INDEX zones_sites_précision_contour ON traitements.zones_sites (précision_contour);
CREATE INDEX zones_sites_idx_catégorie ON traitements.zones_sites (catégorie);
CREATE INDEX zones_sites_idx_origine ON traitements.zones_sites (origine);
CREATE INDEX zones_sites_idx_statut ON traitements.zones_sites (statut);
ALTER TABLE  traitements.zones_sites
    ADD CONSTRAINT zones_sites_references_sites_candidats FOREIGN KEY (pkid_site)
    REFERENCES traitements.sites_candidats(pkid) ON DELETE CASCADE;

Related question on dba.SE:

Unfortunately, the column traitements.zones_sites.origine_id cannot be of type integer because it holds identifier from several origins, some of them not being integers.

Original query:

SELECT emprises.pkid, emprises.pkid_site, emprises.origine, parcelles.idpar
FROM traitements.zones_sites AS emprises
JOIN parcelles ON parcelles.idpk::varchar = emprises.origine_id
WHERE emprises.catégorie = 'emprise_site'
AND emprises.précision_contour = 'contour_inconnu'
AND emprises.origine = 'xxxxxxxxx';

Modified code:

CREATE TEMPORARY TABLE temp_emprises (
    pkid serial PRIMARY KEY,
    pkid_site integer NOT NULL,
    origine varchar(50) NOT NULL, 
    idpk_parcelle integer NOT NULL
);

INSERT INTO temp_emprises (pkid, pkid_site, origine, idpk_parcelle)
SELECT pkid, pkid_site, origine, origine_id::integer 
FROM traitements.zones_sites AS emprises
WHERE emprises.catégorie = 'emprise_site'
AND emprises.précision_contour = 'contour_inconnu'
AND emprises.origine = 'xxxxxxxxx';

CREATE INDEX ON temp_emprises(idpk_parcelle);

SELECT emprises.pkid, emprises.pkid_site, emprises.origine, parcelles.idpar
FROM temp_emprises AS emprises
JOIN parcelles ON parcelles.idpk = emprises.idpk_parcelle;

Is there a better solution?

Best Answer

Proper solution

Your comment (which I added to your question) discloses the root of the problem:

Unfortunately, the column traitements.zones_sites.origine_id cannot be of type integer because it holds identifier from several origins, some of them not being integers.

A roughly normalized DB design would avoid to mix data of different kind in the same column. Then your true integer values could be in a column of type integer and the rest in (a) text column(s). And there wouldn't be a problem to begin with.

While stuck with your current design

Since, as you asserted, the column origine_id can be cast to integer for the selection of rows in your query, I suggest this much faster alternative: create a partial expression index:

CREATE INDEX zones_sites_idx_origine_part_int ON traitements.zones_sites ((origine_id::int))
WHERE catégorie = 'emprise_site'
AND   précision_contour = 'contour_inconnu'
AND   origine = 'xxxxxxxxx';

If the cast in your INSERT into the temp table is guaranteed to work, then so is this index.

Slightly modified query:

SELECT z.pkid, z.pkid_site, z.origine, p.idpar
FROM   traitements.zones_sites z
JOIN   parcelles               p ON p.idpk = z.origine_id::int
WHERE  z.catégorie = 'emprise_site'
AND    z.précision_contour = 'contour_inconnu'
AND    z.origine = 'xxxxxxxxx';

Since the partial index also supports your query perfectly, this should make your current solutions look like slow motion in comparison. And it's simpler and more reliable, too.

The only other index you need is one on parcelles(idpk).
Since the only other column you need from parcelles is idpar, if that table is big and write-load is low so you can get index-only scans out of it, an (additional) multicolumn index on parcelles(idpk, idpar) would make it even faster.

And faster, yet, in the upcoming Postgres 11 with true covering indexes with an INCLUDE column:

Related answer with code example: