How to reduce size of a large SQLite database with series data

sqlite

I'm considering to use SQLite database for a C# application which deals with large volumes of data series. The data is currently in several CSV files of up to 20GBs in size each and of the following format:

2019.07.31 00:00:03.855,1.11568,1.11571,3,0

I was expecting to reduce the size considerably when migrating to SQLite but to some reason am getting some different results.

The example string in CSV format takes 44 bytes (43 chars + new line). As far as I understood the SQLite types definitions described at sqlite.org it should take about the same in the database:
23 byte datetime text + 8*2 reals + 1*2 integers = 41 bytes

The dateTime can be stored as 8-byte integer (c# DateTime.Ticks) which should reduce the size to about 26 bytes per one line which is almost twice less than in CSV format.

For an experiment I tried importing 142,157 lines of data. The CSV file is about 6kk bytes in size but the resulting DB file is about 5kk which is almost the same. Compression (available in DB Browser for SQLite) does not change anything.

The table schema is:

CREATE TABLE "Data" ( "dateTime" INTEGER, "value1" REAL, "value2" REAL, "value3" INTEGER, "value4" INTEGER )

An example row looks like this:

637001280038550000 1.11568 1.11571 3 0

Is it the SQLite internals which make that big overhead in size?
How can I reduce the size?

As a side note, setting Primary Key on the dateTime field reduces the size of the example data file by 0.3kk bytes but adding unique constraint increases it by 3kk bytes making it about 8kk in total which is much bigger than the CSV one.

Update

output from sqlite3_analyzer

w/t PK and contraints
Size of the file in bytes......................... 5242880
Bytes of user payload stored...................... 4349676     83.0%
Bytes of payload.................................. 4349852     83.0%
Bytes of metadata................................. 858115      16.4%

with PK
Size of the file in bytes......................... 4976640
Bytes of user payload stored...................... 3212420     64.5%
Bytes of payload.................................. 3212622     64.6%
Bytes of metadata................................. 1733838     34.8%

with PK and Unique constraint
Size of the file in bytes......................... 8134656
Bytes of user payload stored...................... 3212420     39.5%
Bytes of payload.................................. 3212622     64.6%
Bytes of metadata................................. 1733838     34.8%
*** Page counts for all tables and indices separately
TABLE............................................. 1214        61.1%
SQLITE_AUTOINDEX_TABLE_1.......................... 771         38.8%

So it turns out that the internals take 16%, PK takes additional 19% but still reduces the total size, and the unique contraint takes almost 40%!

The question remains valid. Is it possible to recude the size of the database or that is all one can get from it?

Best Answer

There are some ways. Here are two that seem appropriate for your case:

WITHOUT ROWID tables

The ROWID is an internal primary key that SQLite tables have by default, taking 64 bit per record. Since you were talking about a UNIQUE constraint for the datetime column, this column seems a good candidate for the primary key (which always has a unique constraint attached). So you can create a SQLite table without ROWID, as follows:

CREATE TABLE "Data" (
    "dateTime" INTEGER PRIMARY KEY,
    "value1" REAL, 
    "value2" REAL, 
    "value3" INTEGER, 
    "value4" INTEGER 
) WITHOUT ROWID;

Storing delta values

I have never tried this in SQLite in practice, but: some data is more compactly stored by storing only the difference to the previous row. In this case, the datetime values are great for this, if they can be stored in order into the database. In that case, you would use three columns:

  1. A sequentially numbered integer primary key.
  2. A datetime integer column, only filled for every (say) 1000 records.
  3. A datetime_delta column, storing the difference from the last row with a filled datetime column.

SQLite stores integer values individually into 2-9 bytes (incl. one storage mode byte), depending on their value. Assuming an average of 4 bytes for the sequential primary key and an average of 2.5 bytes for the delta values would save 2.5 bytes compared to the current 9 byte integer. Savings are larger if you have more than one column per table that profit from delta storage, as you need the sequential primary key only once. For example, if value1value4 are measurement values that tend to grow and shrink over time, they are also good candidates.

After loading a row at primary key value n, you would also have to retrieve the row at primary key value n DIV 1000 (meaning integer division) to have the base for your delta values.