Postgresql – Store large compressed documents in Postgresql

compressionperformancepostgresqlpostgresql-performance

Where I work we keep approximately 500,000 image files related to processing for our customer in a file system attached to our Redhat 5.11 production server. We are planning an upgrade to Redhat 7.4 and have already done this for our development database servers. The number of documents is increasing and while we do need to archive older documents, we are talking about moving storage of the documents into a database where we have access to auditing and database level security. If we store the objects in Oracle, they won't be compressed unless we pay for the advanced compression option. We would like top avoid that cost.

Does Postgresql have similar compression for large binary objects? Does it make sense to try using Postgresql for this? I would use Database Heterogeneous Connectivity from Oracle to read and write the large objects. Are there other tools that people have used to store large objects in databases? Thanks

Best Answer

Postgres compresses automatically large objects chunks, as well as bytea through the TOAST mechanism.

But most PDFs are already compressed. As an example I have 1000 dedup-ed pdfs in my mailbox stored as large objets totalling 281MB, and the compression ratio brought by postgres on these is only 11%.

For large objects you may query the compressed size, as a superuser, with

SELECT sum(pg_column_size(data)-4) FROM pg_largeobject WHERE loid = :id

Compression of BYTEA might be a bit better because it compresses the entire contents as a whole (limited to 1GB), whereas large objects are compressed as individual chunks of 2048 bytes or less (in fact, pagesize/4)

The compression ratio of bytea contents can be measured by comparing octet_length(column) with pg_column_size(column)-4

Does it make sense to try using Postgresql for this?

You should probably import a few hundred of your files in Postgres and see how they compress. For large objects it's a simple as \lo_import filename in psql.

Related Solutions

Mysql – Which Database to choose: MySQL or Oracle

Which db to choose? I would choose PostgreSQL having worked with MySQL and PostgreSQL but it is worth noting how different the databases are. I will say I have frequently been impressed (and only rarely disappointed) by what sort of abuse I can throw at PostgreSQL only to watch things be handled gracefully.

In your specific case, however, there may be reasons to look specifically at PostgreSQL. The major issue is programmability and advanced indexing but there are other reasons too (basically in InnoDB your table is contained in the pkey index, while in PostgreSQL you can do a physical-order scan of the table (and this has advantages when pulling a lot of rows). Major disadvantages might include not liking the binary type interfaces (and often saving/retrieving the values can use more memory than I would like). As for filesystem I know a lot of people who are very happy with XFS and also with ZFS.

Finally one specific tradeoff here is the Pg 8k data page size. It sounds like your time series would be larger than this, and so would be TOASTed (moved to extended storage). This speeds up many queries, but it also slows down retrieving the specific time series slightly because extra physical files must be scanned.

As for programmability, basically you could write functions that can look inside a time series for specific useful information and then you can index the output of this function. Consequently this can be used to speed up searches because the function doesn't have to be run.

SQL Server 2012 – Improve IO Request Time and Reduce Write Latency

Question 1: Though I will do in non peek hours of business but are there any chances that making these changes can corrupt my database or log file?

No this cannot damage your Log file.

Question 2: Does database compression mode can effect IO operation? If yes, how can I resolve it?

Yes database compression affects I/O and as far as my experience goes it decreases the I/O and is actually benefitial if you look at I/O consumption. I will tell you how.

Compressed data will take fewer pages when stored on disk hence when data is read fewer I/O is required to fetch data in memory.
Again since fewer pages are bought into memory this increases buffer pool availibility.
CPU is what consumed when compression or decompression happens. But in most cases you would find that benefits gained by space and I/O would outweight the CPU consumption
Please remember a compressed page will remain compressed on disk as well as compressed when brought in memory for reading or writing. So no additional work of decompressing in memory

This is excellent article on data compression, its big, but would help you in understanding data compression

You must remove all SQL Server related folder and files from Antivirus check specially if you have McAfee Antivirus.

•I am planning to change target recovery time to < 1 minute of data base (specifically tempdb and my database)

I can only say please leave it to default, I am not sure how recovery time has anything to do with log file and its configuration

Regarding Tempdb make sure you have Data files equal to number of physical cores. Paul has more to say about how many files you can have as per cores IMHO you can start with 4 tempdb files but keep monitoring for contention. Make sure that all tempdb data files have SAME INITIAL SIZE and SAME AUTOGROWTH setting. You can also enable TF 1118 to avoid contention

To check contention you can run below query

select 
session_id, 
wait_duration_ms,   
resource_description
from    sys.dm_os_waiting_tasks 
where   wait_type like 'PAGE%LATCH_%' and 
resource_description like '2:%'

There are some good article for you to read about Tempdb and Contention

Edit:

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file

This message means that a particular session or query requested for I/O to get data from disk but that session had to wait more than 15 sec and after that request was catered. You can guess that 15 sec is threshold value moment this time is crossed message is dumped in errorlog. This MOSTLY means that disk is not able to cope up with I/O request which is getting generated and in turn means disk might be slow. Since you said 90 % of your errorlog is filled with this information I am forced to believe that underlying hardware is slow or might require a firmware upgrade. This Article will help you in understanding the issue and fixing it

Best Answer

Related Solutions

Mysql – Which Database to choose: MySQL or Oracle

SQL Server 2012 – Improve IO Request Time and Reduce Write Latency

Related Question