Postgresql – Saving images as base64 encoded strings, why is it bad

encodingmongodbperformancepostgresql

I've seen this on one of the production databases I've come across and these images apparently cover a large portion of their DB. After researching a lot I couldn't really find a lot of good answers pertaining as to why it is actually bad in terms of performance.

Yes, it is not stream friendly and takes up ~33% more space: https://stackoverflow.com/questions/9722603/storing-image-in-database-directly-or-as-base64-data

Sure, it gives wrong hints to DB: https://stackoverflow.com/questions/9722603/storing-image-in-database-directly-or-as-base64-data

And other answers like: storing files on mongodb don't really explain why this is problem

But, what is actually going on that might cause databases that store images as base64 encoded string to lose performance? Why would a DB lose performance on selects(or any other aspect) due to such practices?

I would also like an answer that specifies a NoSQL(MongoDB) vs RDBMS(PostgreSQL) performance in this regard. Does the nature of having things in memory like MongoDB make the performance hit even worse? And if so, why? How is this image saved as a base64 string behaving on both systems?

Best Answer

The one reason I can see is that you are doing unnecessary work that way. PostgreSQL will compress and toast the base64 string, so you have to pay the price of compression and decompression, unless you set the column to EXTERNAL, then you don't compress, but you waste storage space and I/O bandwidth.

Hint: if you store compressed binary data in PostgreSQL, set the bytea column to EXTERNAL storage. Otherwise you waste CPU in a futile attempt to compress the data.

Related Solutions

Postgresql – Using MongoDB and PostgreSQL together

Some thoughts....

Typically one does not want to store pieces of tightly interrelated information in different systems. The chances of things getting out of sync is significant and now instead of one problem on your hands you have two. One thing you can do with Mongo though is use it to pipeline your data in or data out. My preference is to keep everything in PostgreSQL to the extent this is possible. However, I would note that doing so really requires expert knowledge of PostgreSQL programming and is not for shops unwilling to dedicate to using advanced features. I see a somewhat different set of options than you do. Since my preference is not something I see listed I will give it to you.

You can probably separate your metadata into common data, data required for classes, and document data. In this regard you would have a general catalog table with the basic common information plus one table per class. In this table you would have an hstore, json, or xml field which would store the rest of the data along with columns where you are storing data that must be constrained significantly. This would reduce what you need to put in these tables per class, but would allow you to leverage constraints however you like. The three options have different issues and are worth considering separately:

hstore is relatively limited but also used by a lot of people. It isn't extremely new but it only is a key/value store, and is incapable of nested data structures, unlike json and xml.

json is quite new and doesn't really do a lot right now. This doesn't mean you can't do a lot with it, but you aren't going to do a lot out of the box. If you do you can expect to do a significant amount of programming, probably in plv8js or, if you want to stick with older environments, plperlu or plpython. json is better supported in 9.3 though at least in current development snapshots, so when that version is released things will get better.

xml is the best supported of the three, with the most features, and the longest support history. Then again, it is XML.....

However if you do decide to go with Mongo and PostgreSQL together, note that PostgreSQL supports 2 phase commit meaning you can run the write operations, then issue PREPARE TRANSACTION and if this succeeds do your atomic writes in Mongo. If that succeeds you can then COMMIT in PostgreSQL.

Postgresql – Compound Index on 43 Million PostgreSQL table

In a GiST index, the order of columns has a different significance than in a B-tree index. Per documentation:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

In short: put the most selective columns first.

Your EXPLAIN output shows that the condition on pid is more selective (rows=7836) than the one on outline (rows=63112). If that can be generalized (a single example may be misleading) I suggest this alternative:

CREATE INDEX inventory_compound_idx ON portal.inventory USING gist (pid, outline);

If most of your (important) queries include conditions on both columns, a multicolumn index may serve you well. Else, single columns may be better overall.

Table layout

This is an educated guess since I don't have complete information.

Don't use oid as column name. It's easy to confuse with the OID.
Don't use the name date for a timestamp column. Or rather: don't use the name date for any column, don't use names of base-types for identifiers at all. Can lead to confusing mistakes and error messages.
Create a lookup table for types and only put a small integer type_id into the big table. Pack fixed-length types tightly so not to waste space to padding. Details.
I prefer the type text (or varchar without length limit) over varchar(n). Details.

For example:

CREATE TABLE portal.inventory (
   inventory_id bigint PRIMARY KEY
  ,type_id      integer NOT NULL REFERENCES inv_type(type_id)
  ,pid          integer NOT NULL
  ,size         bigint NOT NULL
  ,ts           timestamp NOT NULL
  ,outline      geography(Polygon,4326)
  ,product_name text
  ,path         text
);

Best Answer

Related Solutions

Postgresql – Using MongoDB and PostgreSQL together

Postgresql – Compound Index on 43 Million PostgreSQL table

Table layout

Related Question