PostgreSQL vs Application Server Database – Performance for Concurrent and Bulk Data

concurrencyperformancepostgresqlsqlite

I asked a question about Ripple's database implementation, and received this response:

The ripple server uses SQLite for structured data and a configurable "back end" for unstructured "bulk" storage.

The structured data consists of things like transactions indexed by which accounts they affected. The unstructured data consists of "chunks" of data indexed by hash that constitute portions of network history.

The preferred back end for bulk storage is currently RocksDB on Linux platforms.

This strikes me as strange since Ripple's structure allows the developers to place almost any demand they wish upon the server operators. In other words, why not use a database server, specifically PostgreSQL?

I found this interesting breakdown of PostgreSQL vs SQLite, and this explanation:

It breaks down to how they implement snapshot isolation.

SQLite uses file locking as a means to isolate transactions, allowing writes to hit only once all reads are done.

Postgres, in contrast, uses a more sophisticated approach called multiconcurrency version control (mvcc), that allows multiple writes to occur in parallel with multiple reads.

First, is it true that the ideal implementation for bulk storage is to use a file database?

Second, is it true that for concurrent reads & writes, PostgreSQL vastly outperforms a file database?

Lastly, when tables approach billions of rows in length, for concurrent performance, is a file database or PostgreSQL superior?

Please assume both alternatives are ideally tuned.

Best Answer

First, is it true that the ideal implementation for bulk storage is to use a flat file database?

If you're talking about SQLite, it isn't a "flat file database". Sure, it's a single file database, but it's highly structured.

Second, is it true that for concurrent reads & writes, PostgreSQL vastly outperforms a flat file database?

PostgreSQL will typically vastly outperform SQLite for workloads that have simultaneous reads and writes because readers can proceed while the table is being written to. For multiple concurrent readers and writers with non-trivial transactions there's no contest. To make this possible PostgreSQL must write each change at least twice and then do cleanup "vacuum" work later.

Often you can serialize many short read and write transactions, so they appear to execute very quickly even though they each wait a while on a lock. So for trivial reads and writes SQLite may still do very well even with several concurrent clients, so long as the lock durations are short. For long transactions it's totally hopeless.

In other words, why not use a database server, specifically PostgreSQL?

Probably because it's trivial to set up and run SQLite, so it's way easier to embed in an application.

PostgreSQL doesn't support in-process embedding. While it's easy enough to bundle the database server it's still hassle many devs don't want. Upgrading PostgreSQL is also a pain, as the file format changes with every major version and requires an upgrade step - wheras SQLite's file format is incredibly upward- and downward-compatible.