Why does there have to be a tradeoff between optimizing a DB for transactional data and optimizing a DB for analytical data

olapoptimizationschematime-series-database

I'm a data scientist who's more on the math and modeling side of things than the engineering side, and I don't know much about the logic, data structures and design principles that underly database design, beyond various platitudes like "Use a relational DB for structured data and a NoSql DB for unstructured data", etc…

I often hear that you have to choose whether a db is going to be used for transactional data that gets frequent updates and requires low latency, or wether it is destined to be more of an analytical database that gets occasional large batch updates and where latency is not a constraint. It is usually implied that there is some tradeoff that has to be decided between going with one or the other?

What I don't understand is why is this tradeoff either/or choice inevitable? I've worked with OLAP cubes (true OLAP platforms, not ROLAP) that could handle near realtime updates and commits to the main data cube from users, as well as batch processing for some of the larger modeling tasks, and could also do realtime slicing and dicing, and aggregation/disaggregation on the fly of various metrics based on predefined aggregation rules and hierarchies.

If such a system is possible (the platforms I worked with have been around since the mid-2000s), why is a backend that can handle both transactional data and analytical data such a difficult to attain objective?

Best Answer

Since there is a limited number of resources (cpu, memory, disk) and you want to use them as efficiently as possible.

For OLTP you typically have many small transactions and concurrency is important. Your data model is typically normalized and most data access is done via index lookup. A subset of your data is used over and over so you want that to be cached in memory.

For OLAP/DW you typically have fewer transactions, but in return, they are larger. Writes are usually done via some ETL process. Your model is often de-normalized and in some sense prepared for your OLAP queries. Keeping all used data in memory is often not possible so it has to be read from disk, sorting is often involved so this memory area is often important here.

Realtime updates to cubes is not a problem as long as there is a limited number of users that update. Throw in millions of users from the OLTP scenario, concurrency will come in to play.

That said, most systems I've seen contains a little bit of both worlds. You are likely to see some compromises in both models as well as physical implementations of those.

When it comes to SQL vs NoSQL, the trend for the last years has been that they are approaching each other. NoSQL has changed from No SQL, to Not Only SQL. Most SQL databases on the other hand now have support for unstructured data like XML and JSON that can efficiently be queried from within SQL.