Sql-server – Database Design for a Graph (Network Topology)

database-designsql server

Is there a recommended pattern for modeling a graph in a relational database?

My presumption is that you'd have the following three tables:

Graphs
------
GraphID, PK
GraphName

Each record in the Graphs table represents a single graph you are modeling.

Nodes
-----
NodeID, PK
GraphID, FK
NodeName, string
...

Each record in the Nodes table represents a node in GraphID, specifying attributes specific to that node.

Edges
-----
EdgeID, PK
GraphID, FK
FromNodeID, FK
ToNodeID, FK
EdgeName, string
...

Each record in the Edges table represents an edge in GraphID, specifying the Nodes it connects and the attributes of an edge (name, distance, etc.).

What I'm having a hard time wrapping my head around is what if there are different types of nodes, where each type of a node has some attributes in common (say 5 or so), but has many unique attributes (say 10-30)?

For simplicity's sake, imagine there are two types of nodes to capture: A and B. How do I model this? I've seen one option being to have a "general" Nodes table:

Nodes
-----
NodeID, PK
GraphID, FK
NodeName, string
AnotherSharedAttribute, string
...

And then a detailed table for node type A and node type B:

NodesA
------
NodeID, PK
UniqueToA1, string
UniqueToA2, string
...


NodesB
------
NodeID, PK
UniqueToB1, string
UniqueToB2, string
...

Am I understanding this approach correctly? Because what flummoxes me is how do I query this efficiently? It seems to me that to determine the attributes for a particular node, I have to check all possible node type tables (NodesA and NodesB in this case) for a matching NodeID. This might not be a big deal with just two different types of nodes, but in my project there are ~30 different types of nodes and 5 different types of edges.

Thanks

Best Answer

There is no great way for modeling a graph in a relational database. They are two different data models. One says data "is really" sets of tuples. The other says it "is really" nodes connected by edges. These are different. Although any real-world scenario can be represented in either model the affordances of each are different. If the use-case is to perform an operation on multiple items that conform to a predicate (e.g. update .. where..) the relational model will, generally speaking, be more convenient. For use-cases involving following connections between items ("people who bought .. also bought ..", "you and Andi have 17 friends in common") the graph model is likely to work better. A relational DBMS makes implementing the former easy and the latter more difficult, and vice versa for a graph DBMS.

There are a few well-trodden relational implementations for the specific sub-set of graphs that are trees.

When implementing a solution in a RDBMS we only have tables to work with. So one has to decide which tables are needed to implement a generalized graph. Are the items of interest in the problem domain simply "things" which can be modeled as a single Node table and all the values of interest in a NodeAttribute table? This is analogous to an EAV, but in a graphy kind of way. Generally this is not useful unless we are concerned purely with the connectivity of nodes.

Usually we want to group things into classes and apply different constraints to each class. We could add a Class attribute to the Node table. This still leaves all the hard work of filtering rows of specific Class, and navigating between classes, to user code.

Relational databases are designed to have one table per class. If we have multiple classes of nodes then each should have its own table. If, as in your question, there are sub-classes of nodes then the problem can be restated as "how to model sub-classes in a relational way?" This is now a problem of table inheritance, about which quite a lot has been written. Some DBMS implement this natively; SQL Server does not.

how do I query this efficiently?

The same way you query any other relational database. Good indexes that support the queries. Understanding the implementation and referencing only those objects required to satisfy the query's needs. If you have 30 types of nodes and 5 types of edges it is unlikely that all will be required in every query.

The discussion above for node tables applies equally for edge tables.

There are hybrid DBMS which allow the representation of data in both graph and relational ways. Starting with 2017 SQL Server introduced graph extensions. Sadly the MATCH clause was little more than syntactic sugar on SQL-89 joins redux. No graph-like affordances were "supported in the first release", though the language was encouraging.

With SQL Server 2019 transitive closure is now supported. One can only hope polymorphism is not far behind.

Best Answer

Related Solutions

Mysql – Product attribute schema design, mixing solutions

Which allows for faster queries in a graph database: a node with multiple properties, or a node that points to multiple nodes

Related Question