I would use a system where each payment is recorded separately
Typically, you'd want to keep the "balance" consistent over time and this requires them to be recorded individually.
- the ability to link payments
- record cancellations separately
I would never concatenate previous payments into one opaque string.
I also wouldn't worry about performance yet: this is premature optimisation. If you do expect 100s of millions of rows, there are other techniques to improved performance. Otherwise, concentrate on data integrity and correctness
My bias is to use a single table with appropriate row-level security.
There are potentially huge maintenance advantages to a single set of tables. If you end up with n
copies of each table, that means that you have to run n
copies of each script every time you want to make a change. Frequently, that means that you end up with at least a few very slightly different versions of the application running at a time because someone forgot to apply script 7 of 23 in a monthly build to one set of tables and someone else created an index on one set of tables to address one customer's issues without adding it to every customer which makes debugging much harder.
A single set of tables has significant scalability advantages. Adding new customers just requires adding a new row to the customer table not deploying a new schema/ database with the new customer's copies of the tables. Adding new customers also doesn't directly add ongoing work for the DBAs. If you have separate copies of the tables, you need someone to deploy tables to create a new customer and every new customer means additional work for the DBA at least to run the scripts one more time every time there is a change.
A single set of tables may also offer performance advantages. If you're using a separate set of tables, each customer would realistically need a separate connection pool in the middle tier. It would defeat the purpose of having separate tables, after all, if your middle tier is connecting as a user that can see every tenant's data because then you'd be implementing row-level security in the middle tier and dealing with all the complexity of multiple sets of tables in the back end. That makes it tough to scale across servers-- do you create a connection pool for every client on every server? Do you send certain clients to certain servers? Do you not preallocate connections and incur the cost of waiting for connections to be established more frequently?
That being said, there are cases where separate tables might be preferred. If your customers are frequently large institutions, for example, separate tables will make it much easier to do things like move a customer to a dedicated box (or at least a dedicated VM) if the customer wants to upgrade to dedicated hardware so that they don't risk performance being affected by other customers. Those large institutions may want greater control over outages and upgrades so it may make sense to have separate tables to allow different customers to be upgraded at different times in order to work with that customer's schedule. Those institutions may find it easier to tell an auditor that their data is physically separate from all other customers rather than explaining that the data is physically intermingled but security controls are in place to guarantee row-level security. If you're going to have relatively few relatively large customers, the amount of maintenance overhead you introduce by having separate tables may not be particularly significant given the general day-to-day maintenance tasks that each client likely requires. In that sort of environment, different clients often have sufficiently different configurations that the problems they encounter are relatively unique even when they're running exactly the same version of the software.
Best Answer
Databases typically implement a lot of the OS style functionality themselves. At a high level, this is done to improve performance and scalability.
To give you some examples:
Operating system file caches are generic, least frequently used, style caches. A database needs much finer control of what goes into the cache and what does not. For example, when you calculate hash tables for queries, you want to give them preference over file data while the query runs (it is cheaper to bring in some new file data than to page out hash tables). A database also need to be in very careful control of the memory allocator to avoid fragmenting the heap. Most operating systems are simply not up to the task of handling high the speed allocations that databases need and database typically implement their own memory manager, purpose build for the database engine.
Operating system caches also don't guarantee persistence unless you call
flush
in Linux or ask for unbuffered I/O in Windows. To guarantee consistency and ACID properties, a database needs much finer control over when data is on disk and when it is cached. Because of this, there is very little benefit to using synchronous, buffered I/O.Furthermore, most OS caches are relatively poorly implemented. Windows file system caches only got really good NUMA style scale support around Windows 2008 and Linux still has very poor file system performance at high core count. FreeBSD doesn't even know what NUMA is. Since database need to run on very large servers (in some cases up to 32 CPU sockets) database vendors often implement their own caching data structures that are much more scalable than what the OS provides by default.
And last but not least: Database are very hungry for I/O and can drive the file system much harder than a typical file server. Relying on traditional, synchronous I/O is simply too slow for most database and they instead use aggressive, core affinitised, asynchronous I/O completion to gain scale. The I/O subsystem of database tends to be much more advanced than what a standard OS provides.
In general, you will find much more advanced control of machine resource in database source code than in operating system kernels. In fact, most high scale database implement their own user space kernel to avoid using the operating system kernel primitives. If you see a database do a lot of kernel transitions, start looking around for a better product.
Great Question!