NoSQL – Sparse Matrix Key-Value Database

nosql

I have a huge sparse matrix ( 10^9 rows, 10^6 cols, density < 0.03% ), where each row has at least one nonzero column, but some column may contain only zeros. The cells are decimal numbers > 0.

I am looking for some db (preferably key-value) that can retrieve whole row or whole column as fast as possible. Also, I don't need to do any analytics in the DB.

I've come across SciDB which should be fast with multidimensional data, but I am afraid it is too complex for my needs.

Also, another option is to use SQL db ( probably Postgres ) but that is a bit slow and can't be scaled as easily as most NoSQL ( I expect fast row increase in matrix ).

So my biggest hope is some key-value storage, but I am not sure how to represent the matrix.

  • Using CRS format – but I have no idea how to implement it using key-value store.
  • Maybe something like { key, key, value } and have indexed both keys, but I am not sure if this is even possible. – So far I've found some clues about secondary index, but I can't imagine how would it look like.

I have some experience with SQL databases but almost none with NoSQL 🙁

Best Answer

Interesting use case...

What makes your problem complex for a DB is your access pattern. Looks like you want to access both by row as well as column. General purpose DBs are generally either row-oriented storage (mostly) or column-oriented storage which will be their most efficient mode of access. They will support the access other way round also (for e.g column based access in a row-oriented storage) but it will not be most efficient for obvious reasons.

If you have a access pattern (row/column) which is way more frequent than the other, you can pick the appropriate DB. If both the access patterns are equally likely, you may consider storing the information in a redundant way. i.e Both the matrix and the transpose of it. As you said the density of the matrix is 0.03%, the overhead may not be too much. You can make the call here.

Coming to the DBs, most of the noSQL DBs offer a flex schema. i.e you do not need to define the schema(columns) upfront and the columns can be optional. For this reason, I think a noSQL DB will be a better fit for this sparse matrix use case. When you query for the row, you will get only the columns which has values in it. You will get the column name along with the result.

The CRS format, while it is great on its own for space saving, it does not fit so well in a DB schema. You will have to handle the access from you application logic. In other words, you will not be really using the row-based access mechanism of DB.

Another option is to use a modified CRS format. For every row, you can store the matrix column values as a series of (column,column value) pairs. You can store this as a single value in a single column of the database. This will avoid the per-column overhead of the DBs. However, you have to do extra processing to decode the matrix columns in your application.

Which DB ? I dont want to pick a name. I would be starting an opinion war. Please do this research separately.