PostgreSQL vs Application Server Database – Performance for Concurrent and Bulk Data

concurrencyperformancepostgresqlsqlite

I asked a question about Ripple's database implementation, and received this response:

The ripple server uses SQLite for structured data and a configurable "back end" for unstructured "bulk" storage.

The structured data consists of things like transactions indexed by which accounts they affected. The unstructured data consists of "chunks" of data indexed by hash that constitute portions of network history.

The preferred back end for bulk storage is currently RocksDB on Linux platforms.

This strikes me as strange since Ripple's structure allows the developers to place almost any demand they wish upon the server operators. In other words, why not use a database server, specifically PostgreSQL?

I found this interesting breakdown of PostgreSQL vs SQLite, and this explanation:

It breaks down to how they implement snapshot isolation.

SQLite uses file locking as a means to isolate transactions, allowing writes to hit only once all reads are done.

Postgres, in contrast, uses a more sophisticated approach called multiconcurrency version control (mvcc), that allows multiple writes to occur in parallel with multiple reads.

First, is it true that the ideal implementation for bulk storage is to use a file database?

Second, is it true that for concurrent reads & writes, PostgreSQL vastly outperforms a file database?

Lastly, when tables approach billions of rows in length, for concurrent performance, is a file database or PostgreSQL superior?

Please assume both alternatives are ideally tuned.

Best Answer

First, is it true that the ideal implementation for bulk storage is to use a flat file database?

If you're talking about SQLite, it isn't a "flat file database". Sure, it's a single file database, but it's highly structured.

Second, is it true that for concurrent reads & writes, PostgreSQL vastly outperforms a flat file database?

PostgreSQL will typically vastly outperform SQLite for workloads that have simultaneous reads and writes because readers can proceed while the table is being written to. For multiple concurrent readers and writers with non-trivial transactions there's no contest. To make this possible PostgreSQL must write each change at least twice and then do cleanup "vacuum" work later.

Often you can serialize many short read and write transactions, so they appear to execute very quickly even though they each wait a while on a lock. So for trivial reads and writes SQLite may still do very well even with several concurrent clients, so long as the lock durations are short. For long transactions it's totally hopeless.

In other words, why not use a database server, specifically PostgreSQL?

Probably because it's trivial to set up and run SQLite, so it's way easier to embed in an application.

PostgreSQL doesn't support in-process embedding. While it's easy enough to bundle the database server it's still hassle many devs don't want. Upgrading PostgreSQL is also a pain, as the file format changes with every major version and requires an upgrade step - wheras SQLite's file format is incredibly upward- and downward-compatible.

Related Solutions

Postgresql – Read Committed Isolation Level

The difference lies between a query and a transaction. A transaction can contain any number of queries. To illustrate the difference, I set up a small example:

CREATE TABLE table_to_be_updated (
    id serial PRIMARY KEY,
    other_column text,
    column_changing text
);

INSERT INTO table_to_be_updated (other_column, column_changing)
VALUES
('value', 'old_value'),
('value', 'other_value'),
('nonvalue', 'doesnt matter');

Then run two transactions concurrently (issuing the commands one by one, the middle line wants to depict the timeline):

                                | <-- BEGIN;
                                |
                                |
                                |     UPDATE table_to_be_updated
BEGIN; -----------------------> |     SET column_changing = 'new_value'
                                |     WHERE
                                |         other_column = 'value' AND
                                |         column_changing = 'old_value';
                                |
                                |
SELECT column_changing -------> |     -- update not yet committed
FROM table_to_be_updated        |
WHERE other_column = 'value';   | <-- COMMIT;
                                |
                                |
SELECT column_changing -------> |
FROM table_to_be_updated        |
WHERE other_column = 'value';   |
                                |
                                |
COMMIT; ----------------------> |

Running these in READ COMMITTED isolation level, the first query returns a row with 'old_value', while the second one shows a row with 'new_value'. On an other run, I change the left-hand-side transaction isolation level:

SET transaction ISOLATION LEVEL REPEATABLE READ;

(The command must be the first statement in a transaction.) Now both SELECTs return the same rowset, while a third one after committing both transactions will show the new row.

Accurate PostgreSQL Database Size for Comparing Representations

Just call pg_database_size(dbname) to know the size of the database.

VACUUM (without the FULL clause) does not free any space, it only marks it as reusable, and thus will not change the database's size (except in a rare boundary case, see Routine Vacuuming).

ANALYZE does statistical sampling and would be useful if you needed the row counts, but for the global db size, it's not necessary.

Best Answer

Related Solutions

Postgresql – Read Committed Isolation Level

Accurate PostgreSQL Database Size for Comparing Representations

Related Question