NoSQL – What is Unstructured Data?

cassandradatabase-designnosqlsql server

we are currently running at the edge of resources with our mssql server based solution.

We have now many traditional options regarding the next move to tackle the load:

  • buy faster CPUs and IO
  • split some customers to seperate server
  • move db to cluster

All are either expensive in terms of licensing and hardware or time. So, I want to add another option by moving the whole system to a scalable solution that nosql engine cassandra promises.

Yet, I am not sure and not experienced with noSQL databases, so I need to understand the structure of "unstructured" data.

In our application, we basically store data entered by users in various ways as "key-value" lists. There is a parent table, that contains the head element (like an Order) and there is a child table with the key-value pairs comprising the contents of the order (like Order_Lines).

Business-wise, Order and OrderLines are a unit. But due to the RDBMS, they are stored in tables and must be joined all the time.

During operations, we sometimes choose to load only the top part, but most of the time, we load the head row + some KVPs to display some useful information.

For example, in an overview list, we show the head identifier + some values to in columns for each row.

UPDATE: We store forms of any kind. So, basically we store "documents". Nevertheless, we have to prepare and search through these forms by any value, sort etc. Data access control adds another layer of compexity on the database.

As you may guess, the amount and availability of certain KVPs varies from object to object. There is no valid possibility to create single tables for each kind of object as we would have to create thousands of tables for the different data combinations.

Would this kind of "Dictionary" like datasets be better stored in a noSQL database? And will we have performance benefits from this?
Would cassandra model these head+KVPs as one dataset? Looking at the cassandra webpage and some tutorials, I have the impression, that there is not so much of a difference between our RDBMS and cassandra in terms of data organisation – leaving us with the same huge amount of joins if you wanted to select 5 KVPs for a list for each row.

Enlightenment is welcome, also pointers to papers explaining the issues are ok.

Best Answer

There are a couple of concepts which need to be distinguished. One is about structure and the other about schema.

Structured data is one where the application knows in advance the meaning of each byte it receives. A good example is measurements from a sensor. In contrast a Twitter stream is unstructured. Schema is about how much of the structure is communicated to the DBMS as how it is asked to enforce this. It controls how much the DBMS parses the data it stores. A schema-required DBMS such as SQL Server can store unparsed data (varbinary) or optionally-parsed data (xml) and fully parsed data (columns).

NoSQL DBMSs lie on a spectrum from no parsing (key-value stores) upwards. Cassandra offers reatively rich functionality in this respect. Where they differ markedly to relational stores is in the uniformity of the data. Once a table is defined only data which matches that definition may be held there. In Cassandra, however, even if columns and families are defined there is no requirement for any two rows in the same table to look anything like each other. It falls to the application designer to decide how much goes in a single row (also referred to as a document) and what is held separately, linked by pointers. In effect, how much denormalisation do you want.

The advantage is you can retrieve a full set of data with a single sequential read. This is fast. One downside is that you, the application programer, are now solely responsible for all data integrity and backward compatibility concerns, for ever, for every bit of code that ever touches this data store. That can be difficult to get right. Also, you are locked into one point of view on the data. If you key your rows by order number, how do you report on the sale on one particular product, or region, or customer?