Optimizing regex based query in sqlite

optimizationquery-performanceregexsqlite

I'm using an sqlite database to store manually created labels for some data automatically queried from a live system. The data from the live system consists primarily of an address, comprised of 3 parts. Let's use URLs an example, the three parts being the protocol, the domain and the path. Initially, I would load a couple 100k worth of addresses into a table with each field of the address being a column and together building the primary key. The labels are then in additional columns.

CREATE TABLE OldWebsites (
    protocol VARCHAR (255) NOT NULL,
    domain   VARCHAR (255) NOT NULL,
    path     VARCHAR (255) NOT NULL,
    label1   INTEGER,
    label2   TEXT,
    CONSTRAINT address PRIMARY KEY (
        protocol,
        domain,
        path
    )
);

I found myself repeating labels over and over based on certain patterns that the address would match. Since I would always extend this table with new data and remove old data, this became too much of a hazzle, so I tried a different approach, namely just loading the existing addresses into one table and then have other tables for the data where I would write regex matchers for the address components

CREATE TABLE Websites (
    protocol VARCHAR (255) NOT NULL,
    domain   VARCHAR (255) NOT NULL,
    path     VARCHAR (255) NOT NULL,
    CONSTRAINT address PRIMARY KEY (
        protocol,
        domain,
        path
    )
);

CREATE TABLE Label1 (
    protocol_re VARCHAR (255) NOT NULL,
    domain_re   VARCHAR (255) NOT NULL,
    path_re     VARCHAR (255) NOT NULL,
    label1   INTEGER
    CONSTRAINT address_matcher PRIMARY KEY (
        protocol_re,
        domain_re,
        path_re
    )
);

CREATE TABLE Label2 (
    protocol_re VARCHAR (255) NOT NULL,
    domain_re   VARCHAR (255) NOT NULL,
    path_re     VARCHAR (255) NOT NULL,
    label2   TEXT,
    CONSTRAINT address_matcher PRIMARY KEY (
        protocol_re,
        domain_re,
        path_re
    )
);

Assume that I have already (using other queries) guaranteed, that there is exactly one match in each label table for each address in the Websites table. I would now like to write a query that reconstructs a table like the original OldWebsites one by matching labels and automatically queried data.

Something like this

SELECT Websites.*,
       Label1.label1,
       Label2.label2
  FROM Websites
       JOIN
       Label1 ON (Websites.protocol REGEXP '^' || Label1.protocol_re || '$' AND 
                  Websites.domain REGEXP '^' || Label1.domain_re || '$' AND 
                  Websites.path REGEXP '^' || Label1.path_re || '$') 
       JOIN
       Label2 ON (Websites.protocol REGEXP '^' || Label2.protocol_re || '$' AND 
                  Websites.domain REGEXP '^' || Label2.domain_re || '$' AND 
                  Websites.path REGEXP '^' || Label2.path_re || '$');

Now.. this is really slow, especially for more label tables, using PCRE sqlite3 extension for the REGEXP function.

I would like to know if there's way to optimize this query using either parallelization (the query should run ideally from python) or using the knowledge that there is exactly 1 match in each Label table.

From my understanding, multiple inner joins should take at most the sum of the individual joins, correct?

Perhaps indexes are also helpful, but I have only a basic idea of what they are and no idea whether they would be of help here.

Best Answer

I noticed that my assumption that multiple inner joins should just take the sum in time of the individual joins is apparently not true. I got the query up to a decent speed by preprocessing the parts to be joined, so a more efficient join strategy can be used:

SELECT Websites.*,
       L1.label1,
       L2.label2
  FROM Websites
       JOIN
       (
           SELECT Websites.*,
                  Label1.label1
             FROM Websites
                  JOIN
                  Label1 ON (Websites.protocol REGEXP '^' || Label1.protocol_re || '$' AND 
                             Websites.domain REGEXP '^' || Label1.domain_re || '$' AND 
                             Websites.path REGEXP '^' || Label1.path_re || '$') 
       )
       AS L1 USING (
           protocol,
           domain,
           path
       )
       JOIN
       (
           SELECT Websites.*,
                  Label2.label2
             FROM Websites
                  JOIN
                  Label1 ON (Websites.protocol REGEXP '^' || Label2.protocol_re || '$' AND 
                             Websites.domain REGEXP '^' || Label2.domain_re || '$' AND 
                             Websites.path REGEXP '^' || Label2.path_re || '$') 
       )
       AS L2 USING (
           protocol,
           domain,
           path
       );

It may not look like much, but it works very well.