SQLite String Column Deduplication and Normalization on Insert

database-designnormalizationsqlite

I have a database of text tuples. Imagine e.g. full file paths + standard comments. Then you dump large file tree and generate comments for files. A file can have lots of comments, but their text repeats exactly. The DB gets into several GBs size, so I think it's quite big for sqlite standard, I'm a noob here though.

Anyways, since strings in individual columns do repeat quite a lot, but combinations are original, I thought I can have tables for distinct original strings, and combinations just as tuples of foreign keys (I believe that's called normalization), then a VIEW, preserving all API, for reading.

The question is, can I implement deduplication mechanism on the DB side for insert?

I thought of something like INSERT on (text_column1, text_column2, text_column3) and then write some kind of INSTEAD OF INSERT trigger, that would split it into 3 INSERT IF NOT EXISTS commands + 1 insert into relation table. But I don't think it is even possible to have different "interface" and "storage" schema. I certainly failed to write it.

I have supplementary questions which tightly related (thus not worthy of separate entries, I believe):

Maybe sqlite3 already does string deduplication behind the scenes? (I doubt it, since mutable strings would complicate things a bit. Nothing undoable though.)
If it's hard then maybe it's a bad idea for some reason?

If it helps my data is read-only once inserted.

I've read:

But they regard different SQL systems, and don't really answer general question. I gather that this is not a popular problem and solution.

Best Answer

Here is a rough sketch. I could only test with SQLite 3.8 and it appears as if UPSERT is introduced in 2018-06-04 - Release 3.24.0. I.e. the following is untested, but hopefully you can make something out of it anyhow

create table Texts
( tid integer not null primary key AUTOINCREMENT
, textval varchar(20) not null unique);

create index x1 on texts (textval);

create table T
( x int not null 
, tid int not null references texts (tid)
, primary key (x, tid) );

create view v as 
   select t.x, texts.textval
   from t
   join texts
       on t.tid = texts.tid
;

CREATE TRIGGER trig1 
INSTEAD OF INSERT ON V 
BEGIN
        -- insert unless textval is already in place
        INSERT INTO texts (textval)
        VALUES (NEW.textval) ON CONFLICT (textval) DO NOTHING;

        -- lookup tid for textval
        insert into t (x, tid)
        select NEW.x, texts.tid
        from texts where texts.textval = NEW.textval;

END;

rowid appears to be recommended over autoincrement, but that is sort of beside the point so I used it anyhow.

For 3.8 I used this ugly workaround:

CREATE TRIGGER trig1 
INSTEAD OF INSERT ON V 
BEGIN
        -- insert unless textval is already in place
        INSERT INTO texts (textval)
        SELECT NEW.textval FROM (VALUES(1))
        WHERE NOT EXISTS (
            SELECT 1 FROM texts WHERE textval = NEW.textval 
        );

        --lookup tid for textval
        insert into t (x, tid)
        select NEW.x, texts.tid
        from texts where texts.textval = NEW.textval;
END;

You can try it at DB<>Fiddle. Test:

insert into V (x,textval) values (1,'a'); 
insert into V (x,textval) values (5,'bb');  
insert into V (x,textval) values (15,'a');

select * from v;
x   textval
1   a
5   bb
15  a

select * from texts;
tid textval
1   a
2   bb

select * from t;
x   tid
1   1
5   2
15  1

Related Solutions

Database Design – Preventing Duplicate Data with Selective Normalization

If I understand your dilemma correctly, you have:

Two tables, each of which can have a comment
One of the tables is optional, i.e. it may not have an entry to correspond to the other
For the second table, the comment is also optional, such that even if the second table has a record to match the first, the comment in the second may just "default" to a copy of the first.
On the other hand, the second table may have a distinct comment after all.

If this is the case, your concern is that recording the "default" (duplicate) comment in the second table is wasteful or even dangerous since the data could get out of whack.

In this situation, you can use the SQL COALESCE function along with a LEFT OUTER JOIN to solve your problem.

Using the outer join lets you use a single SQL statement that pulls together the two tables (when there are records in each) or just pull data from the first (mandatory) table if the second (optional) record is missing. No complicated branching, just a single SQL select. The coalesce allows you to pick the first non-NULL value in a list of values. This is useful because it lets you take the first comment as a default if the second comment is NULL. The second comment can be NULL because either (a) there is no matching record in the second table or (b) the comment within the second table is NULL.

It seems like in your case there will always be an ISSUE entry. Sometimes this entry is created by a user and sometimes it is the result of an unsatisfactory STATUS_CHECK.

In this case you want to select:

...
COALESCE(I.comment, C.comment) as Remark
FROM ISSUE I LEFT OUTER JOIN STATUS_CHECK C
ON I.ID = C.ISSUE_ID
...

The net effect of this is that if you have a satisfactory status check, it will show the status check comment. If you have an unsatisfactory status check or an independently raised issue, it will show the (non-NULL) comment from the issue table.

MySQL – Automatic Deduplication (Normalization) of Strings

MySQL has no such feature.

It is left to the user to "normalize" the data either for avoiding having to update multiple spots and/or for saving space.

In your example, it is generally not practical to do such with first/last name. But it may be advisable for "locations".

It is not practical to dedup names (for example) because the payoff is poor. With "Mary" sitting somewhere else, the code has to implicitly (as you hypothecate) or explicitly (via a JOIN) reach for the string. In large datasets, this is likely to cause an extra disk hit -- costly. Also, "Mary" is 4 characters; the first_name_id might be a MEDIUMINT UNSIGNED, which is 3 bytes -- not much savings. For bigger strings that repeat a lot (company names), the tradeoffs might be better.

The main purpose of "normalization" is to put things in a single location. A company name should be spelled out only once in a system that talks about companies. In its place an id would be used -- possibly an integer, possibly a ticker symbol (as in a Stock database). When the company changes its name, only one spot needs to be changed. (If the ticker changes, as happened with AOL, then the code is messy.)

Best Answer

Related Solutions

Database Design – Preventing Duplicate Data with Selective Normalization

MySQL – Automatic Deduplication (Normalization) of Strings

Related Question