Modeling Graph Data in Cassandra DB

cassandragraph

I want to use Apache Cassandra to store a large amount of graph data according to a property graph model. The model contains the following entities:

Vertices: Contains a map of key/value pairs (properties). Some keys should be indexed for querying (see below).
Edges: Connects two vertices to each other in a given direction. Contains a label and possibly some edge data. The edge data is a map of key/value pairs, where some keys should also be indexed for querying.

Both vertices and edges have a unique primary key, which can be a string or integer value.

Example:

#A vertex
{node_type:'module',pk: 1,...}
#Another vertex
{node_type:'function',pk: 2,...}

#An edge
{incoming_vertex: 1,outgoing_vertex: 2,label: 'body',data : {}}

I want to perform the following types of queries on the graph:

Retrieve a list of vertices based on their primary key (e.g. "fetch the vertex with pk = a5f…") or the value of one or several indexed properties (e.g. "fetch all vertices with node_type = 'module' and …").
Traverse the graph from a given vertex along its edges, using the edge label, direction and one or several indexed edge properties to determine the path taken (e.g. "fetch all vertices that are connected to vertex A through an outgoing edge with label body and property … = …).

In addition, I have the following requirements and boundary conditions:

Retrieving the list of edges for a given vertex should be as efficient as possible (O(1) ideally).
The number of edges will be much larger than the number of vertices in the graph.
The model should scale to several billion vertices and several hundred billion edges (appropriate hardware provided).
The graph data will usually be written only once and read many times, so the model can be optimized for query performance at the cost of write performance.

My initial idea for a data model is the following:

Use one column-family for vertices as well as edges respectively, where the row-key is the primary key of the vertex/edge and a single text column contains its JSON data. Indexes on vertex/edge properties are modeled as additional columns (whose data is denormalized and manually updated whenever the vertex/edge data changes)
Use one dynamic column family for managing the adjacency (edge) list for vertices, with a composite primary key that contains the primary key of the vertex, the primary key of the edge, the edge label and the edge direction (incoming or outgoing) for each vertex.

Is this a sensible data model? Any other suggestions on how to implement this?

Best Answer

For Graph database on Cassandra have a look at TitanDB:

What you need is already implemented in TitanDB. Implementing your own Graph Database is not trivial, and would be very time consuming. In most cases, a proven solution is best. (BTW, I am not involved in TitanDB development or business.) I have no idea about your use case, but I do not see a reason to implement something new, except as a hobby.

Update I found a whitepaper about Titan GraphDB's data model in database: https://github.com/thinkaurelius/titan/wiki/Titan-Data-Model. It gives some hints how to design a datastore for graphs.

Aurelius is now also part of Datastax and they work on a combined solution for storing big graphs in Cassandra.

Stage 1 : Sample Data

DROP DATABASE IF EXISTS bootvis;
CREATE DATABASE bootvis;
USE bootvis
CREATE TABLE graph_edges
(
    pk int not null auto_increment,
    object_id int default null,
    tm timestamp not null default current_timestamp,
    from_id int,
    to_id int,
    primary key (pk)
);
INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
SELECT SLEEP(2);
INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
SELECT SLEEP(1);
INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2);
INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1);
INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
UPDATE graph_edges SET object_id = 1 WHERE object_id IS NULL;
INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2);
INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1);
SELECT SLEEP(2);
INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
UPDATE graph_edges SET object_id = 2 WHERE object_id IS NULL;
SELECT * FROM graph_edges;

Stage 2 : Sample Data Loaded

mysql> DROP DATABASE IF EXISTS bootvis;
Query OK, 1 row affected (0.03 sec)

mysql> CREATE DATABASE bootvis;
Query OK, 1 row affected (0.00 sec)

mysql> USE bootvis
Database changed
mysql> CREATE TABLE graph_edges
    -> (
    ->     pk int not null auto_increment,
    ->     object_id int default null,
    ->     tm timestamp not null default current_timestamp,
    ->     from_id int,
    ->     to_id int,
    ->     primary key (pk)
    -> );
Query OK, 0 rows affected (0.12 sec)

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
Query OK, 4 rows affected (0.05 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> SELECT SLEEP(2);
+----------+
| SLEEP(2) |
+----------+
|        0 |
+----------+
1 row in set (2.00 sec)

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
Query OK, 2 rows affected (0.05 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT SLEEP(1);
+----------+
| SLEEP(1) |
+----------+
|        0 |
+----------+
1 row in set (1.00 sec)

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2);
Query OK, 2 rows affected (0.07 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1);
Query OK, 8 rows affected (0.08 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
Query OK, 4 rows affected (0.13 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
Query OK, 2 rows affected (0.10 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> UPDATE graph_edges SET object_id = 1 WHERE object_id IS NULL;
Query OK, 22 rows affected (0.06 sec)
Rows matched: 22  Changed: 22  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2);
Query OK, 2 rows affected (0.05 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1),(2,1);
Query OK, 8 rows affected (0.05 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT SLEEP(2);
+----------+
| SLEEP(2) |
+----------+
|        0 |
+----------+
1 row in set (2.00 sec)

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
Query OK, 4 rows affected (0.08 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
Query OK, 2 rows affected (0.05 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
Query OK, 4 rows affected (0.06 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
Query OK, 2 rows affected (0.05 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (1,2),(1,2),(1,2),(1,2);
Query OK, 4 rows affected (0.05 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> INSERT INTO graph_edges (from_id,to_id) VALUES (2,1),(2,1);
Query OK, 2 rows affected (0.06 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> UPDATE graph_edges SET object_id = 2 WHERE object_id IS NULL;
Query OK, 28 rows affected (0.06 sec)
Rows matched: 28  Changed: 28  Warnings: 0

mysql> SELECT * FROM graph_edges;
+----+-----------+---------------------+---------+-------+
| pk | object_id | tm                  | from_id | to_id |
+----+-----------+---------------------+---------+-------+
|  1 |         1 | 2012-11-20 17:29:13 |       1 |     2 |
|  2 |         1 | 2012-11-20 17:29:13 |       1 |     2 |
|  3 |         1 | 2012-11-20 17:29:13 |       1 |     2 |
|  4 |         1 | 2012-11-20 17:29:13 |       1 |     2 |
|  5 |         1 | 2012-11-20 17:29:15 |       2 |     1 |
|  6 |         1 | 2012-11-20 17:29:15 |       2 |     1 |
|  7 |         1 | 2012-11-20 17:29:16 |       1 |     2 |
|  8 |         1 | 2012-11-20 17:29:16 |       1 |     2 |
|  9 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 10 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 11 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 12 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 13 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 14 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 15 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 16 |         1 | 2012-11-20 17:29:16 |       2 |     1 |
| 17 |         1 | 2012-11-20 17:29:16 |       1 |     2 |
| 18 |         1 | 2012-11-20 17:29:16 |       1 |     2 |
| 19 |         1 | 2012-11-20 17:29:16 |       1 |     2 |
| 20 |         1 | 2012-11-20 17:29:16 |       1 |     2 |
| 21 |         1 | 2012-11-20 17:29:17 |       2 |     1 |
| 22 |         1 | 2012-11-20 17:29:17 |       2 |     1 |
| 23 |         2 | 2012-11-20 17:29:17 |       1 |     2 |
| 24 |         2 | 2012-11-20 17:29:17 |       1 |     2 |
| 25 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 26 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 27 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 28 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 29 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 30 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 31 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 32 |         2 | 2012-11-20 17:29:17 |       2 |     1 |
| 33 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 34 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 35 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 36 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 37 |         2 | 2012-11-20 17:29:19 |       2 |     1 |
| 38 |         2 | 2012-11-20 17:29:19 |       2 |     1 |
| 39 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 40 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 41 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 42 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 43 |         2 | 2012-11-20 17:29:19 |       2 |     1 |
| 44 |         2 | 2012-11-20 17:29:19 |       2 |     1 |
| 45 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 46 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 47 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 48 |         2 | 2012-11-20 17:29:19 |       1 |     2 |
| 49 |         2 | 2012-11-20 17:29:19 |       2 |     1 |
| 50 |         2 | 2012-11-20 17:29:19 |       2 |     1 |
+----+-----------+---------------------+---------+-------+
50 rows in set (0.00 sec)

mysql>

Stage 3 : Create Query with Running Counters That Changes When Tuple Changes (call it the Tuple Change Query)

SET @r=0;
SET @old_from=-1;
SET @old_to=-1;
SELECT from_id,to_id,
@inc:=((@old_from<>from_id)||(@old_to<>to_id)),
@old_from:=from_id,@old_to:=to_id,
@r:=(@r+@inc) as group_number
FROM graph_edges;

Stage 4 : Run the Tuple Change Query

mysql> SET @r=0;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @old_from=-1;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @old_to=-1;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT from_id,to_id,
    -> @inc:=((@old_from<>from_id)||(@old_to<>to_id)),
    -> @old_from:=from_id,@old_to:=to_id,
    -> @r:=(@r+@inc) as group_number
    -> FROM graph_edges;
+---------+-------+------------------------------------------------+--------------------+----------------+--------------+
| from_id | to_id | @inc:=((@old_from<>from_id)||(@old_to<>to_id)) | @old_from:=from_id | @old_to:=to_id | group_number |
+---------+-------+------------------------------------------------+--------------------+----------------+--------------+
|       1 |     2 |                                              1 |                  1 |              2 |            1 |
|       1 |     2 |                                              0 |                  1 |              2 |            1 |
|       1 |     2 |                                              0 |                  1 |              2 |            1 |
|       1 |     2 |                                              0 |                  1 |              2 |            1 |
|       2 |     1 |                                              1 |                  2 |              1 |            2 |
|       2 |     1 |                                              0 |                  2 |              1 |            2 |
|       1 |     2 |                                              1 |                  1 |              2 |            3 |
|       1 |     2 |                                              0 |                  1 |              2 |            3 |
|       2 |     1 |                                              1 |                  2 |              1 |            4 |
|       2 |     1 |                                              0 |                  2 |              1 |            4 |
|       2 |     1 |                                              0 |                  2 |              1 |            4 |
|       2 |     1 |                                              0 |                  2 |              1 |            4 |
|       2 |     1 |                                              0 |                  2 |              1 |            4 |
|       2 |     1 |                                              0 |                  2 |              1 |            4 |
|       2 |     1 |                                              0 |                  2 |              1 |            4 |
|       2 |     1 |                                              0 |                  2 |              1 |            4 |
|       1 |     2 |                                              1 |                  1 |              2 |            5 |
|       1 |     2 |                                              0 |                  1 |              2 |            5 |
|       1 |     2 |                                              0 |                  1 |              2 |            5 |
|       1 |     2 |                                              0 |                  1 |              2 |            5 |
|       2 |     1 |                                              1 |                  2 |              1 |            6 |
|       2 |     1 |                                              0 |                  2 |              1 |            6 |
|       1 |     2 |                                              1 |                  1 |              2 |            7 |
|       1 |     2 |                                              0 |                  1 |              2 |            7 |
|       2 |     1 |                                              1 |                  2 |              1 |            8 |
|       2 |     1 |                                              0 |                  2 |              1 |            8 |
|       2 |     1 |                                              0 |                  2 |              1 |            8 |
|       2 |     1 |                                              0 |                  2 |              1 |            8 |
|       2 |     1 |                                              0 |                  2 |              1 |            8 |
|       2 |     1 |                                              0 |                  2 |              1 |            8 |
|       2 |     1 |                                              0 |                  2 |              1 |            8 |
|       2 |     1 |                                              0 |                  2 |              1 |            8 |
|       1 |     2 |                                              1 |                  1 |              2 |            9 |
|       1 |     2 |                                              0 |                  1 |              2 |            9 |
|       1 |     2 |                                              0 |                  1 |              2 |            9 |
|       1 |     2 |                                              0 |                  1 |              2 |            9 |
|       2 |     1 |                                              1 |                  2 |              1 |           10 |
|       2 |     1 |                                              0 |                  2 |              1 |           10 |
|       1 |     2 |                                              1 |                  1 |              2 |           11 |
|       1 |     2 |                                              0 |                  1 |              2 |           11 |
|       1 |     2 |                                              0 |                  1 |              2 |           11 |
|       1 |     2 |                                              0 |                  1 |              2 |           11 |
|       2 |     1 |                                              1 |                  2 |              1 |           12 |
|       2 |     1 |                                              0 |                  2 |              1 |           12 |
|       1 |     2 |                                              1 |                  1 |              2 |           13 |
|       1 |     2 |                                              0 |                  1 |              2 |           13 |
|       1 |     2 |                                              0 |                  1 |              2 |           13 |
|       1 |     2 |                                              0 |                  1 |              2 |           13 |
|       2 |     1 |                                              1 |                  2 |              1 |           14 |
|       2 |     1 |                                              0 |                  2 |              1 |           14 |
+---------+-------+------------------------------------------------+--------------------+----------------+--------------+
50 rows in set (0.00 sec)

mysql>

Please notice that @inc changes only when the tuple changes !!!

Stage 5 : Put Tuple Change Query in a Subquery, extract Needed Data From Tuple Change Query, Run GROUP BY group_number (call it Duplicate Extraction Query)

SET @r=0;
SET @old_from=-1;
SET @old_to=-1;
SELECT group_number,from_id,to_id FROM
(
    SELECT from_id,to_id,
    @inc:=((@old_from<>from_id)||(@old_to<>to_id)),
    @old_from:=from_id,@old_to:=to_id,
    @r:=(@r+@inc) as group_number
    FROM graph_edges
) g GROUP BY group_number;

Stage 6 : Run Duplicate Extraction Query

mysql> SET @r=0;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @old_from=-1;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @old_to=-1;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT group_number,from_id,to_id FROM
    -> (
    ->     SELECT from_id,to_id,
    ->     @inc:=((@old_from<>from_id)||(@old_to<>to_id)),
    ->     @old_from:=from_id,@old_to:=to_id,
    ->     @r:=(@r+@inc) as group_number
    ->     FROM graph_edges
    -> ) g GROUP BY group_number;
+--------------+---------+-------+
| group_number | from_id | to_id |
+--------------+---------+-------+
|            1 |       1 |     2 |
|            2 |       2 |     1 |
|            3 |       1 |     2 |
|            4 |       2 |     1 |
|            5 |       1 |     2 |
|            6 |       2 |     1 |
|            7 |       1 |     2 |
|            8 |       2 |     1 |
|            9 |       1 |     2 |
|           10 |       2 |     1 |
|           11 |       1 |     2 |
|           12 |       2 |     1 |
|           13 |       1 |     2 |
|           14 |       2 |     1 |
+--------------+---------+-------+
14 rows in set (0.00 sec)

mysql>

Stage 7 : (OPTIONAL) Show Count for Each group_number

SET @r=0;
SET @old_from=-1;
SET @old_to=-1;
SELECT group_number,from_id,to_id,count(1) group_count FROM
(
    SELECT from_id,to_id,
    @inc:=((@old_from<>from_id)||(@old_to<>to_id)),
    @old_from:=from_id,@old_to:=to_id,
    @r:=(@r+@inc) as group_number
    FROM graph_edges
) g GROUP BY group_number;

Stage 8 : (OPTIONAL) Run the Show Count Query for Each group_number

mysql> SET @r=0;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @old_from=-1;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @old_to=-1;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT group_number,from_id,to_id,count(1) group_count FROM
    -> (
    ->     SELECT from_id,to_id,
    ->     @inc:=((@old_from<>from_id)||(@old_to<>to_id)),
    ->     @old_from:=from_id,@old_to:=to_id,
    ->     @r:=(@r+@inc) as group_number
    ->     FROM graph_edges
    -> ) g GROUP BY group_number;
+--------------+---------+-------+-------------+
| group_number | from_id | to_id | group_count |
+--------------+---------+-------+-------------+
|            1 |       1 |     2 |           4 |
|            2 |       2 |     1 |           2 |
|            3 |       1 |     2 |           2 |
|            4 |       2 |     1 |           8 |
|            5 |       1 |     2 |           4 |
|            6 |       2 |     1 |           2 |
|            7 |       1 |     2 |           2 |
|            8 |       2 |     1 |           8 |
|            9 |       1 |     2 |           4 |
|           10 |       2 |     1 |           2 |
|           11 |       1 |     2 |           4 |
|           12 |       2 |     1 |           2 |
|           13 |       1 |     2 |           4 |
|           14 |       2 |     1 |           2 |
+--------------+---------+-------+-------------+
14 rows in set (0.00 sec)

mysql>

Stage 9 : Give it a Try !!!

NoSQL – What is Unstructured Data?

There are a couple of concepts which need to be distinguished. One is about structure and the other about schema.

Structured data is one where the application knows in advance the meaning of each byte it receives. A good example is measurements from a sensor. In contrast a Twitter stream is unstructured. Schema is about how much of the structure is communicated to the DBMS as how it is asked to enforce this. It controls how much the DBMS parses the data it stores. A schema-required DBMS such as SQL Server can store unparsed data (varbinary) or optionally-parsed data (xml) and fully parsed data (columns).

NoSQL DBMSs lie on a spectrum from no parsing (key-value stores) upwards. Cassandra offers reatively rich functionality in this respect. Where they differ markedly to relational stores is in the uniformity of the data. Once a table is defined only data which matches that definition may be held there. In Cassandra, however, even if columns and families are defined there is no requirement for any two rows in the same table to look anything like each other. It falls to the application designer to decide how much goes in a single row (also referred to as a document) and what is held separately, linked by pointers. In effect, how much denormalisation do you want.

The advantage is you can retrieve a full set of data with a single sequential read. This is fast. One downside is that you, the application programer, are now solely responsible for all data integrity and backward compatibility concerns, for ever, for every bit of code that ever touches this data store. That can be difficult to get right. Also, you are locked into one point of view on the data. If you key your rows by order number, how do you report on the sale on one particular product, or region, or customer?