Graph Databases – Why CRUD Operations Are Inefficient

graph

I heard that graph DBs are inherently worse at CRUD operations than relational DBs. Is this true? If yes, why is that?

Some thoughts:

To do a crud operation in a realtional DB you basically have to work with one row (best case). This should be quite efficient as long as it is not a columnoriented format (OLAP).

To the best of my knowledge there are 2 popular ways to store a graph:

Either actually storing "objects which point at each other". This would be a natural representation of a graph.

Or storing a table, which lists relations. This would be the non-natural representation of a graph.

In the second case I understand why crud operations are slow: You might have to touch a lot of tables to represent what would just be an update of a single row in a relational schema.

Why is it slow in the first case? To an update or delete you just might have to create/read/update/delete a single object.

Best Answer

I think the software is too complicated to give simple yes/ no answers to this question. It also depends on how you define CRUD.

Let's take the "place order" function as an example. When seen from a business point-of-view this is a single atomic action. In a normalized relational database it will likely touch many tables, however. There will be an insert for the order; perhaps updates to customer or shopping cart tables; reads of many related tables to ensure referential correctness. All these statements will (should?) be wrapped in a database transaction. So a single business action becomes many database statements wrapped in a single database transaction.

Were a document database used instead of a relational one likely there would have been a single write. A graph DBMS would sit between these two, being more or less normalized depending on the particular application's architecture.

Looking just at on-disk structures the discussion is more nuanced, too. Inserting a single row to a table seems simple. If there are indexes then likely there will be write amplification. If there is an index split then more so. If the row holds BLOBs these may be stored in separate data structures, or different files. There could be cascading actions through triggers or materialized views. Compression and encryption may incur large CPU consumption.

There are many on-disk structures to support graph-like processing. In one nodes are the basic object and edges become properties of nodes. Edges are stored as pointer-list in the nodes at either end of the edge. So, to create a node would be one write. To create an edge will be two. To insert a row with a foreign key constraint (the nearest equivalent to an edge) will be a read and a write.

On the read side there are swings-and-roundabouts, too. Counting child rows by parent will be a two-table join in the relational world. If edge pointers are stored as a list within the node it's a single node-list scan. Doing is-that-connected-to-this queries (friends of people who know someone that lives in the same city that person can get to by rail within one hour) is easy in graph DBs; it's monstrous in the relational world. For the "R" bit of CRUD, for transitive queries, graph wins hands-down every time.

In conclusion, I question the premise of the title. If milliseconds are precious in your context then test, at scale, on production-equivalent hardware.