Postgresql – Performance of selecting strings: substring IN vs regular expression

postgresqlquery-performance

I have a large table with 1 billion+ records. These are the important columns:

CREATE TABLE BigTable
(
   id BIGSERIAL PRIMARY KEY NOT NULL,
   hexid VARCHAR(255), -- usually 24 or 48 characters
   type VARCHAR(255),
   <and other columns>
);

I need to find some records matching on a substring of the hexid column as well as the type column. There is an index on the hexid column. I need to search for a list of strings matching the last 24 characters of the hexid (the records I am interested in will always have 48 character hexids). There are two queries that I could run and I'm wondering which will perform better:

Query 1 using WHERE IN:

SELECT * FROM BigTable
WHERE substring(hexid, 25) IN (
    'deadbeef4b00018c09bf5db1',
    '5f478c50deadbeef97039344',
    '5fc0b855a6a8b600deadbeef')
AND type IN ('OneType', 'SomeType', 'AnotherType')
ORDER BY id ASC;

with explain plan:

Sort  (cost=105476938.69..105477029.92 rows=36492 width=638)
  Sort Key: id
  ->  Seq Scan on BigTable  (cost=0.00..105468257.46 rows=36492 width=638)
        Filter: (((type)::text = ANY ('{OneType,SomeType,AnotherType}'::text[])) AND (""substring""((hexid)::text, 25) = ANY ('{deadbeef4b00018c09bf5db1,5f478c50deadbeef97039344,5fc0b855a6a8b600deadbeef}'::text[])))"

Query 2 using regular expressions:

SELECT * FROM BigTable
WHERE hexid ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'
AND type ~ '(One|Some|Another)Type'
ORDER BY id ASC;

with explain plan:

Sort  (cost=100022576.55..100022577.16 rows=243 width=638)
  Sort Key: id
  ->  Seq Scan on BigTable  (cost=0.00..100022566.92 rows=243 width=638)
        Filter: (((hexid)::text ~ '\w{24}(deadbeef4b00018c09bf5db1|5f478c50deadbeef97039344|5fc0b855a6a8b600deadbeef)'::text) AND ((type)::text ~ '(One|Some|Another)Type'::text))

Based on the explain plans the cost will be more or less the same.

  • Query 1 performs a substring, doesn't have pattern matching [fast], but does have WHERE IN clauses [can negatively affect performance].
  • Query 2 has pattern matching (that takes care of the substring) [slow], but doesn't have WHERE IN clauses [doesn't negatively affect performance].

Getting the data is a once-off process. I don't have the luxury of time running these queries individually to determine which one performs better. I need to make an informed decision on which one to go with.

  1. Which query will give the best performance?
  2. Why does the first explain plan show rows=36492 and the second rows=243?
  3. Is there perhaps another query that would perform much better? Keep in mind that, since it is a once-off query, adding additional indices would not be helpful.

(I'm interested in knowing the answers for PostgreSQL, but I guess it will be similar for other RDBMSs.)

Best Answer

The version with IN will be slightly faster, because = ANY is cheaper than ~.

The estimate suggests otherwise, but only because it estimates a different number of result rows.

You can make these queries faster with an index on the more selective condition.