PostgreSQL – Should Math Be Done Within Database for ACIDity?

Architecturepostgresqlpythontrigger

It is my first question on Database Administration Stack Exchange. Please be kind if something is not done the nice way..

I would like to be advised concerning the way I'm working on my database, in order to make it as ACID as possible.

I am working on a Web Application using Flask Framework (Python) and a PostgreSQL (9.4) database. I am using SQLAlchemy as ORM.

I am doing Financial Analysis on Stocks handled by my Hedge Fund.
My work consists in 3 parts:

updating the database daily
doing animal mathematics on it
looking for best results
presenting the results on the web app

Here is my database:

security table (the column active only means whether the stock should be updated or not):

 id |   ticker       | active 
----+----------------+--------
  1 | 3993 HK Equity | t      
  2 | A2A IM Equity  | t      
  3 | AA UN Equity   | f      
  4 | AA/ LN Equity  | f      
  5 | AAL LN Equity  | t      
  6 | AALB NA Equity | t

valuation table

 security_id |    date    | px_close |  px_open |  px_high |  px_low | volume 
-------------+------------+----------+----------+----------+---------+--------
          42 | 2012-11-06 |   44.757 |   45.099 |   45.336 |  43.918 | 157295 
          42 | 2012-11-07 |   45.099 |   44.791 |   45.099 |  44.226 | 162671 
          42 | 2012-11-08 |   44.245 |   45.051 |    45.08 |  44.241 |  56763 
          42 | 2012-11-09 |   43.994 |   44.307 |   44.672 |  43.771 | 123424 
          43 | 2012-11-06 |  143.994 |  144.255 |  144.601 | 143.785 |  54598 
          43 | 2012-11-07 |  143.885 |  143.629 |  144.108 | 143.524 |  39380 
          43 | 2012-11-08 |  144.392 |  144.056 |  144.928 | 143.866 |  60509 
          43 | 2012-11-09 |  144.008 |  144.563 |   144.61 | 143.913 |  62784

regression table (I have a lot of other tables like this one, e.g polynomial_regression, learning_clustering, multiple_regression, etc. but it is for the example):

 security_id | driver_id | security_field | driver_field | lag |     percentile   |      percentage    |      1w_change     |        r2
-------------+-----------+----------------+--------------+-----+------------------+--------------------+--------------------+-------------------
       10260 |       622 | px_open        | px_low       |   0 | 2.39085239085239 |  0.233283156908672 |  -1.65479076267458 |  0.76137019793836
       10260 |       622 | px_open        | best_eps     |   0 | 25.8333333333333 |  0.070876900435926 |   4.62480252764613 | 0.593456515505473
       10260 |       622 | px_high        | px_close     |   0 | 2.18295218295218 |  0.211301222947739 |  -7.67181960128018 | 0.756561105159073
       10260 |       622 | px_high        | px_open      |   0 | 3.01455301455301 |  0.173091398167856 |  -2.06843453731421 | 0.755248026072118
       10260 |       622 | px_high        | px_high      |   0 |  2.5987525987526 |  0.189682461742673 |  -10.5755212601686 | 0.748422363395702
       10260 |       622 | px_high        | px_low       |   0 |  2.5987525987526 |  0.206663840524837 |  -3.72904823112292 | 0.758691421914131
       10260 |       622 | px_high        | best_eps     |   0 | 26.7857142857143 | 0.0642618785351929 |  0.600880162491535 | 0.586932757156747
       10260 |       622 | px_low         | px_close     |   0 | 2.18295218295218 |  0.241074941757577 |  -3.72991088758724 | 0.758602680862882
       10260 |       622 | px_low         | px_open      |   0 | 3.01455301455301 |  0.193588738372212 | -0.616152379637857 | 0.753488068588711
       10260 |       622 | px_low         | px_high      |   0 | 2.80665280665281 |   0.21139288442443 |  -4.97343018089906 | 0.743185969928886

Main of the operations done on the database are made via the Flask App, but not always. It's why I would like to have my database as 'strong' as possible (I mean here ACID). What's more, I obviously trust more Postgres that Flask.

The update of my database is made everyday at 3am. New values of the day before are inserted, and new active stocks are updated from the beginning of the times (and other stocks, e.g the ones just after dividends, etc.).

Then, 2h later, when the update is done, a Flask script is computing all the new regressions pairwise, and inserting the results.
The thing is that between the update of the values, and the update of the regressions, the database is not in a 'stable' state: new values have been inserted, but the corresponding regressions doesn't exist yet.

Would you think I should translate all my mathematical analysis from Python to SQL, and create Triggers launching it whenever a new value is inserted?

My problem is that I don't know the 'speed' of SQL when it deals with stats and maths, compared to other language. I know that basic functions (correlation, R^2, …) are already implemented in SQL, but I am using far more 'advanced' (I mean 'complex' here..) functions (even if I have not represented it here).
If you think it is not a good idea, would you have other tips I could use to stress the 'well-behaviour' of my database, despite of using external tools as the Python App?

Thank you very much for your advises,
Edouard

Best Answer

Welcome to DBA Stack Exchange!

My problem is that I don't know the 'speed' of SQL when it deals with stats and maths, compared to other language. I know that basic functions (correlation, R^2, ...) are already implemented in SQL, but I am using far more 'advanced' (I mean 'complex' here..) functions (even if I have not represented it here).

As a rule of thumb, for basic aggregations (Grouping, Joining, Summation) on large datasets (millions of rows) SQL will perform better. I'd recommend you leverage SQL's strengths here use a hybrid approach. Prep your data with basic aggregations as much as possible, but leave the higher-level mathematics and analysis logic in Python.

Would you think I should translate all my mathematical analysis from Python to SQL, and create Triggers launching it whenever a new value is inserted?

No.

As stated before, you can do basic aggregate functions, but I expect you will find it tedious if not outright impossible to model the more 'complex' statistical function from a feature-rich language such as Python into a constrained language such as SQL. Here is a decent article exploring this concept. Even if we are successful translating Python functions into SQL, our success is going to highly dependent on whether the database is optimized to deliver a set-based representation of all the data inputs needed for those calculations at the time of insert.

The thing is that between the update of the values, and the update of the regressions, the database is not in a 'stable' state: new values have been inserted, but the corresponding regressions doesn't exist yet.

Let's be sure we aren't dogmatically promoting database ACIDity at the potential cost of performance, and the assured cost associated with increased code complexity. I DO understand the importance of ACIDity, but from your description this sound like an analytic workflow to me. We have a 3AM bulk load of data, which is subsequently analyzed outside of SQL, and then re-inserted into the database.

To summarize.. YEs, SQL will perform better on basic aggregations across large datasets, but higher-level statistical analysis is best left to Python. I'm not sure it's worth the risk translating Python functions into SQL triggers for the sake of ACIDity.

P.S. I know this is a late answer to your question, so let us know what you eventually decided to implement!

Related Solutions

PostgreSQL Failover – Recommended Tools

If you are interested in synchronous DB Replication and Failover, I have a suggestion. You could look into using two tools:

DRBD (Disk Replicated Block Device) provides Disk Level Replication (synchronous) between disks on two different servers. The disk changes are replicated over the network. It is best to use a crossover cable on eth2 for the NICs using the 192.168.x.x netblock.
ucarp allows for DBVIP management and failover between two different servers.

You can use them in conjunction as follows:

DBServer1 has IP 10.1.2.30
DBServer2 has IP 10.1.2.40
DB VIP is 10.1.2.70

Setup ucarp on two different servers in such a way that each ucarp instance communicates with the other along a virtual router ID

DBServer1 will have

/usr/local/sbin/ucarp -v 200 -p sometagname --interface=eth2 -a 10.1.2.70 -s 10.1.2.30 -b 3 -r 5 --upscript=/usr/local/sbin/vip-up.sh --downscript=/usr/local/sbin/vip-down.sh --neutral -z -B

DBServer2 will have

/usr/local/sbin/ucarp -v 200 -p sometagname --interface=eth2 -a 10.1.2.70 -s 10.1.2.40 -b 4 -r 5 --upscript=/usr/local/sbin/vip-up.sh --downscript=/usr/local/sbin/vip-down.sh --neutral -z -B

What's the reason -b (broadcasts) is 3 on one server and 4 on the other? In case both servers are restarted, the server with the lowest -b will bring ucarp first. As for -r, that is dead ratio. Thus, DBServer1 will come up in 15 seconds (3X5) and DBServer2 comes up in 20 seconds (4X5).

Let's say DBServer1 crashes. ucarp on DBServer2 will check for 20 seconds for a handshaking fro ucarp on DBServer1. If nothing is detected, the vip-up.sh script on DBServer2 will take control of the DBVIP.

OK all of this is just for failover and DBVIP management. What about the PostgreSQL database? Surprisingly, PostgreSQL is only running on one server at a time. Whoever has the DBVIP should have DRBD in Primary state. This ties back in to the vip-up script. How do you script it?

Here is /usr/local/sbin/vip-down.sh script (conceptual)

#! /bin/sh
exec 2> /dev/null
/sbin/drbdadm disconnect drbd0
/sbin/drbdadm  primary drbd0 &&
/bin/mount postgres-data-folder /mnt/drbd0
/sbin/ip addr add 10.1.2.70/24 dev eth2
/sbin/service postgres start &&
/usr/local/sbin/vipmon.sh &
touch /tmp/vip-up

Here is /usr/local/sbin/vip-down.sh script (conceptual)

#! /bin/sh
exec 2> /dev/null
/sbin/service postgres stop
/bin/umount -l  /dev/drbd0
/sbin/drbdadm secondary drbd0
/sbin/ip addr del 10.1.2.70/24 dev eth2
/bin/rm /tmp/vip-up
/usr/bin/killall -9 ucarp
/usr/local/sbin/vip-down.sh
kill `ps auxww | grep vipmon | awk '{print $2}'`

All you need to do is setup pg_hba.conf to make sure all users come via 10.1.2.70

Essentially, vip-up performs this sequence

Have DRBD go Primary
Mount postgres data folder on /dev/drbd0
Startup postgres
start vipmon process

Contrawise, vip-down performs this sequence

Shutdown postgres
unount /dev/drbd0
Have DRBD go Secondary

What about vipmon.sh? You could script that to check, in an indefinite loop, the state of postgres (it is running), check the DRBD device (can you still write to the posts data folder)

With this setup, you have postgres on the DRBD Primary and a disk level copy of the postgres data folder on the other DBServer (the DRBD Secondary). postgres is not running on the DRBD Secondary.

When a failover happens, you just need time for ucarp to detect is is safe to mount the postgres data, startup postgres, and acivate the vipmon script.

What great about this setup is that in the event of a failover, the DBServer that becomes DRBD Primary should have a 100% disk level copy of the server you failed away from. Thus starting postgres should level you at a consistent state. The transaction logs (in pg_xlog) should be in tact (less the intermittency due to failover)

I hope these suggestions help. Although I am a MySQL DBA, I use MySQL and DRBD at my employer's web hosting company on a regular basis. I installed MySQL/DRBD in the aforementioned manner. I did this once for a client using PostgreSQL/DRBD. It works great. It's Open Source. You just need to perform due diligence in learning DRBD and ucarp. This would include reconnecting DRBD after a failover, handling split brain scenario where both DB Servers go Primary, and things like these.

Mysql – Scale MySQL trigger for thousands of inserts per seconds

Don't use a TRIGGER. Instead, gather a minute's worth (say) of data in a staging table. Then do the desired updates en masse. If you need the user to get up-to-the-second information (instead of just up-to-the-minute), have the user's query do that final update also.

To phrase it differently, summarization is much more efficiently done in bulk.

Your statements are confusing on the scale of the problem: "thousands of inserts per second" versus "a row is added to table token every second". Which is it?

If serious scaling of inserts and processing, see my blog on staging tables. That discusses "ping ponging" of a pair of staging tables in order for the speed to be self-adjusting.

To discuss your situation further, please provide SHOW CREATE TABLE and some more details.

Best Answer

Related Solutions

PostgreSQL Failover – Recommended Tools

Mysql – Scale MySQL trigger for thousands of inserts per seconds

Related Question