Modeling Graph Data in Cassandra DB

cassandragraph

I want to use Apache Cassandra to store a large amount of graph data according to a property graph model. The model contains the following entities:

  • Vertices: Contains a map of key/value pairs (properties). Some keys should be indexed for querying (see below).
  • Edges: Connects two vertices to each other in a given direction. Contains a label and possibly some edge data. The edge data is a map of key/value pairs, where some keys should also be indexed for querying.

Both vertices and edges have a unique primary key, which can be a string or integer value.

Example:

#A vertex
{node_type:'module',pk: 1,...}
#Another vertex
{node_type:'function',pk: 2,...}

#An edge
{incoming_vertex: 1,outgoing_vertex: 2,label: 'body',data : {}}

I want to perform the following types of queries on the graph:

  • Retrieve a list of vertices based on their primary key (e.g. "fetch the vertex with pk = a5f…") or the value of one or several indexed properties (e.g. "fetch all vertices with node_type = 'module' and …").
  • Traverse the graph from a given vertex along its edges, using the edge label, direction and one or several indexed edge properties to determine the path taken (e.g. "fetch all vertices that are connected to vertex A through an outgoing edge with label body and property … = …).

In addition, I have the following requirements and boundary conditions:

  • Retrieving the list of edges for a given vertex should be as efficient as possible (O(1) ideally).
  • The number of edges will be much larger than the number of vertices in the graph.
  • The model should scale to several billion vertices and several hundred billion edges (appropriate hardware provided).
  • The graph data will usually be written only once and read many times, so the model can be optimized for query performance at the cost of write performance.

My initial idea for a data model is the following:

  • Use one column-family for vertices as well as edges respectively, where the row-key is the primary key of the vertex/edge and a single text column contains its JSON data. Indexes on vertex/edge properties are modeled as additional columns (whose data is denormalized and manually updated whenever the vertex/edge data changes)
  • Use one dynamic column family for managing the adjacency (edge) list for vertices, with a composite primary key that contains the primary key of the vertex, the primary key of the edge, the edge label and the edge direction (incoming or outgoing) for each vertex.

Is this a sensible data model? Any other suggestions on how to implement this?

Best Answer

For Graph database on Cassandra have a look at TitanDB:

What you need is already implemented in TitanDB. Implementing your own Graph Database is not trivial, and would be very time consuming. In most cases, a proven solution is best. (BTW, I am not involved in TitanDB development or business.) I have no idea about your use case, but I do not see a reason to implement something new, except as a hobby.

Update I found a whitepaper about Titan GraphDB's data model in database: https://github.com/thinkaurelius/titan/wiki/Titan-Data-Model. It gives some hints how to design a datastore for graphs.

Aurelius is now also part of Datastax and they work on a combined solution for storing big graphs in Cassandra.