NoSQL – Sparse Matrix Key-Value Database

nosql

I have a huge sparse matrix ( 10^9 rows, 10^6 cols, density < 0.03% ), where each row has at least one nonzero column, but some column may contain only zeros. The cells are decimal numbers > 0.

I am looking for some db (preferably key-value) that can retrieve whole row or whole column as fast as possible. Also, I don't need to do any analytics in the DB.

I've come across SciDB which should be fast with multidimensional data, but I am afraid it is too complex for my needs.

Also, another option is to use SQL db ( probably Postgres ) but that is a bit slow and can't be scaled as easily as most NoSQL ( I expect fast row increase in matrix ).

So my biggest hope is some key-value storage, but I am not sure how to represent the matrix.

Using CRS format – but I have no idea how to implement it using key-value store.
Maybe something like { key, key, value } and have indexed both keys, but I am not sure if this is even possible. – So far I've found some clues about secondary index, but I can't imagine how would it look like.

I have some experience with SQL databases but almost none with NoSQL 🙁

Best Answer

Interesting use case...

What makes your problem complex for a DB is your access pattern. Looks like you want to access both by row as well as column. General purpose DBs are generally either row-oriented storage (mostly) or column-oriented storage which will be their most efficient mode of access. They will support the access other way round also (for e.g column based access in a row-oriented storage) but it will not be most efficient for obvious reasons.

If you have a access pattern (row/column) which is way more frequent than the other, you can pick the appropriate DB. If both the access patterns are equally likely, you may consider storing the information in a redundant way. i.e Both the matrix and the transpose of it. As you said the density of the matrix is 0.03%, the overhead may not be too much. You can make the call here.

Coming to the DBs, most of the noSQL DBs offer a flex schema. i.e you do not need to define the schema(columns) upfront and the columns can be optional. For this reason, I think a noSQL DB will be a better fit for this sparse matrix use case. When you query for the row, you will get only the columns which has values in it. You will get the column name along with the result.

The CRS format, while it is great on its own for space saving, it does not fit so well in a DB schema. You will have to handle the access from you application logic. In other words, you will not be really using the row-based access mechanism of DB.

Another option is to use a modified CRS format. For every row, you can store the matrix column values as a series of (column,column value) pairs. You can store this as a single value in a single column of the database. This will avoid the per-column overhead of the DBs. However, you have to do extra processing to decode the matrix columns in your application.

Which DB ? I dont want to pick a name. I would be starting an opinion war. Please do this research separately.

Related Solutions

A Key/Value store database

Are you familiar with the concept of a Key/Value Pair? Presuming you're familiar with Java or C# this is in the language as a map/hash/datatable/KeyValuePair (the last is in the case of C#)

The way it works is demonstrated in this little sample chart:

Color        Red
Age          18
Size         Large
Name         Smith
Title        The Brown Dog

Where you have a key (left) and a value (right) ... notice it can be a string, int, or the like. Most KVP objects allow you to store any object on the right, because it's just a value.

Since you'll always have a unique key for a particular object that you want to return, you can just query the database for that unique key and get the results back from whichever node has the object (this is why it's good for distributed systems, since there's other things involved like polling for the first n nodes to return a value that match other nodes returns).

Now my example above is very simple, so here's a slightly better version of the KVP

user1923_color    Red
user1923_age      18
user3371_color    Blue
user4344_color    Brackish
user1923_height   6' 0"
user3371_age      34

So as you can see the simple key generation is to put "user" the userunique number, an underscore and the object. Again, this is a simple variation, but I think we begin to understand that so long as we can define the part on the left and have it be consistently formatted, that we can pull out the value.

Notice that there's no restriction on the key value (ok, there can be some limitations, such as text-only) or on the value property (there may be a size restriction) but so far I've not had really complex systems. Let's try and go a little further:

app_setting_width      450
user1923_color         Red
user1923_age           18
user3371_color         Blue
user4344_color         Brackish
user1923_height        6' 0"
user3371_age           34
error_msg_457          There is no file %1 here
error_message_1        There is no user with %1 name
1923_name              Jim
user1923_name          Jim Smith
user1923_lname         Smith
Application_Installed  true
log_errors             1
install_path           C:\Windows\System32\Restricted
ServerName             localhost
test                   test
test1                  test
test123                Brackish
devonly
wonderwoman
value                  key

You get the idea... all those would be stored in one massive "table" on the distributed nodes (there's math behind it all) and you would just ask the distributed system for the value you need by name.

At the very least, that's my understanding of how it all works. I may have a few things wrong, but that's the basics.

obligatory wikipedia link http://en.wikipedia.org/wiki/Associative_array

Fastest key-value store for random disk reads

Without context, this is a poor question.

Bound to a single machine your requirement is a function of IO performance, not platform. An Access mdb file on a FusionIO card could outperform Trinity on a 5400rpm drive in a narrow band of tests.

You'll have to be more specific if you want answers of any value.

Edit: following comment.

Context would be a description of what you're building. As I indicated, whichever k-v system you choose you will be IO bound when constrained to a single machine. On EC2 block storage the choice of k-v becomes even more irrelevant.

If you're building on EC2 look at the native products they already provide e.g. SimpleDB or Elasticache.

Best Answer

Related Solutions

A Key/Value store database

Fastest key-value store for random disk reads

Related Question