SQLite String Column Deduplication and Normalization on Insert

database-designnormalizationsqlite

I have a database of text tuples. Imagine e.g. full file paths + standard comments. Then you dump large file tree and generate comments for files. A file can have lots of comments, but their text repeats exactly. The DB gets into several GBs size, so I think it's quite big for sqlite standard, I'm a noob here though.

Anyways, since strings in individual columns do repeat quite a lot, but combinations are original, I thought I can have tables for distinct original strings, and combinations just as tuples of foreign keys (I believe that's called normalization), then a VIEW, preserving all API, for reading.

The question is, can I implement deduplication mechanism on the DB side for insert?

I thought of something like INSERT on (text_column1, text_column2, text_column3) and then write some kind of INSTEAD OF INSERT trigger, that would split it into 3 INSERT IF NOT EXISTS commands + 1 insert into relation table. But I don't think it is even possible to have different "interface" and "storage" schema. I certainly failed to write it.

I have supplementary questions which tightly related (thus not worthy of separate entries, I believe):

  1. Maybe sqlite3 already does string deduplication behind the scenes? (I doubt it, since mutable strings would complicate things a bit. Nothing undoable though.)
  2. If it's hard then maybe it's a bad idea for some reason?

If it helps my data is read-only once inserted.

I've read:

  1. Does PostgreSQL have a variable character storage layer that optimizes storage space by automatic dedupe?

  2. Implement a deduplication trigger in Oracle

  3. Automatic deduplication (normalization) of strings in MySQL

But they regard different SQL systems, and don't really answer general question. I gather that this is not a popular problem and solution.

Best Answer

Here is a rough sketch. I could only test with SQLite 3.8 and it appears as if UPSERT is introduced in 2018-06-04 - Release 3.24.0. I.e. the following is untested, but hopefully you can make something out of it anyhow

create table Texts
( tid integer not null primary key AUTOINCREMENT
, textval varchar(20) not null unique);

create index x1 on texts (textval);

create table T
( x int not null 
, tid int not null references texts (tid)
, primary key (x, tid) );

create view v as 
   select t.x, texts.textval
   from t
   join texts
       on t.tid = texts.tid
;

CREATE TRIGGER trig1 
INSTEAD OF INSERT ON V 
BEGIN
        -- insert unless textval is already in place
        INSERT INTO texts (textval)
        VALUES (NEW.textval) ON CONFLICT (textval) DO NOTHING;

        -- lookup tid for textval
        insert into t (x, tid)
        select NEW.x, texts.tid
        from texts where texts.textval = NEW.textval;

END;

rowid appears to be recommended over autoincrement, but that is sort of beside the point so I used it anyhow.

For 3.8 I used this ugly workaround:

CREATE TRIGGER trig1 
INSTEAD OF INSERT ON V 
BEGIN
        -- insert unless textval is already in place
        INSERT INTO texts (textval)
        SELECT NEW.textval FROM (VALUES(1))
        WHERE NOT EXISTS (
            SELECT 1 FROM texts WHERE textval = NEW.textval 
        );

        --lookup tid for textval
        insert into t (x, tid)
        select NEW.x, texts.tid
        from texts where texts.textval = NEW.textval;
END;

You can try it at DB<>Fiddle. Test:

insert into V (x,textval) values (1,'a'); 
insert into V (x,textval) values (5,'bb');  
insert into V (x,textval) values (15,'a');

select * from v;
x   textval
1   a
5   bb
15  a

select * from texts;
tid textval
1   a
2   bb

select * from t;
x   tid
1   1
5   2
15  1