I have a database of text tuples. Imagine e.g. full file paths + standard comments. Then you dump large file tree and generate comments for files. A file can have lots of comments, but their text repeats exactly. The DB gets into several GBs size, so I think it's quite big for sqlite standard, I'm a noob here though.
Anyways, since strings in individual columns do repeat quite a lot, but combinations are original, I thought I can have tables for distinct original strings, and combinations just as tuples of foreign keys (I believe that's called normalization), then a VIEW
, preserving all API, for reading.
The question is, can I implement deduplication mechanism on the DB side for insert?
I thought of something like INSERT
on (text_column1, text_column2, text_column3)
and then write some kind of INSTEAD OF INSERT
trigger, that would split it into 3 INSERT IF NOT EXISTS
commands + 1 insert into relation table. But I don't think it is even possible to have different "interface" and "storage" schema. I certainly failed to write it.
I have supplementary questions which tightly related (thus not worthy of separate entries, I believe):
- Maybe sqlite3 already does string deduplication behind the scenes? (I doubt it, since mutable strings would complicate things a bit. Nothing undoable though.)
- If it's hard then maybe it's a bad idea for some reason?
If it helps my data is read-only once inserted.
I've read:
But they regard different SQL systems, and don't really answer general question. I gather that this is not a popular problem and solution.
Best Answer
Here is a rough sketch. I could only test with SQLite 3.8 and it appears as if
UPSERT
is introduced in 2018-06-04 - Release 3.24.0. I.e. the following is untested, but hopefully you can make something out of it anyhowrowid appears to be recommended over autoincrement, but that is sort of beside the point so I used it anyhow.
For 3.8 I used this ugly workaround:
You can try it at DB<>Fiddle. Test: