An efficient storage method to store Graph Data from GraphX (Spark)

cassandracolumnstoredatabase-designgraphstorage

I generate graph data using Spark's GraphX library and I need an efficient way to store graph data. I have access to Apache Cassandra and ideally I want to store the graph data in there but the issue is I don't know how to efficiently store the graph structure in Cassandra. I have looked at Titan's backend storage in Cassandra but I could not find a detailed explanation of how the data (edges and properties) is formatted and stored. I am giving Titan as an example, any existing alternative or methodology is welcome. The persistence layer doesn't even have to be Cassandra necessarily. I mostly need an efficient way to store the graph that also enables me to do filtering on the backend. I.e. if I store on HDFS, I have to load the whole DB content on Spark to analyze it and perform transformations, and for big data that is unfeasible so I need a kind of storage that enables me to filter server-side what I need before loading it onto my memory.

Edit: My requirements is that this is a distributed graph database that can handle Big Data, is scalable and has to have a Spark connector of sorts (or a way to connect with Spark functionalities).

Best Answer

TitanDB stores graphs in adjacency list format which means that a graph is stored as a collection of vertices with their adjacency list. The adjacency list of a vertex contains all of the vertex’s incident edges (and properties).

They used VertexID is the partition key, PropertyKeyID or EdgeID as clustering key and property value or edge properties as normal column.

Titan Data Layout

Their Cassandra Table schema :

CREATE TABLE edgestore (
    key blob,
    column1 blob,
    value blob,
    PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (column1 ASC);

Where

key     => VertexID
column1 => PropertyKeyID or DIR+OtherID+EdgeID
value   => PropertyValue or Pair of EdgePropertyKeyID and EdgePropertyValue