Mysql – Document Store & Relational DB Hybrid. Exists? Possible

database-designmongodbMySQLnosqlreplication

Does anyone know of any product or open source project that attempts to strap together Mongo DB and MySQL for instance?

I hear a lot about how relational DBs being ACID compliant makes it nearly impossible to do sharding as easily as in Mongo and this is the achilles heel. The NoSQL fans are surely right. A technology must eventually parallelize or die.

No to NoSchema

However, schemaless DBs are not a sustainable solution to many problems. Schema changes are only hard because they force you to see what you're breaking. Not seeing doesn't mean you're not breaking.

Best of both?

What I need help understanding is why it seems accepted that you can't in theory have the advantages of both. Can't a relational DB be engineered to compromise with a wise strategy of eventual consistency? I understand why it's more difficult than with documents – a single fetch is composed of records stored physically all over the disk. But how about a document store which acts as a front for a relational DB?

Imagine

The DB has a built in ORM to map from relational to document collection and you interface with the document in your app. Essentially, the document collections are like materialised views with write-back capability. By default the relational DB part is ACID compliant and ensures that non-acidic documents are eventually consistent. Cross-record schema constraints may not be performant enough but most other schema controls could be.

When you query a large batch of documents, if a few are out of sync – because another document sharing the same denormalised data has just been written to – then they are silently refreshed live from the underlying DB by default. If blocking is an issue then you can choose to handle those dirty docs with a number of other looser consistency options.

Pros and pros

..and small cons

You don't get the document store advantage of fast writes but

Gets NoSQL advantage: massive and fast reads, easy sharding and no O-R impedance mismatch for front-end developers.

Gets SQL advantage: Organisations can stay nimble, able to work with their data from new angles not thought of initially. Schema helps keep production data from turning into a turgid mess of exceptions before its too late.

Gets hybrid advantage: Rows can be physically re-organised in the RDB to follow the patterns suggested by the most frequently accessed document collections.
When the load gets too high for the relational DB on one box this will make sharding more of a possibility (with a possible tradeoff against whole-system analytical queries but that's what you have an OLAP for). The relational DB will also have much fewer reads than if it were directly user facing.

I've misunderstood something. I must have.

I don't imagine this is an easy bit of engineering but it seems within the powers of the those who have brought us the products we already have.

I'm not seeking general discussion here but I wouldn't be surprised if there's a big, concrete gotcha I'm missing in this account and I'd like to know what it is.

Best Answer

SQL Server has JSON support though it seems to be a variation on it's XML support.

PostGres has had JSON support since 9.2. Teradata has had it since 2014.

This makes them hybrids. Obviously PostGres is the open-source one.

It really depends on what you need from JSON support. If it is to return the document on a key then that has always been possible without explicit JSON support.

If you want to index certain parts of the JSON document then that does require specific JSON support. There is the question of how sparse the attribute is that you want to index. If it is present in the vast majority of cases (or even a mandatory attribute) then I would break it out of the JSON and have it as an explicit attribute in a hybrid store.

If by parallelize or die you mean support a distributed dataset or die then I'm not sure that I agree. For OLTP work you would have to be in a Times top 100 company to approach the limits of the traditional RDBMS. By definition most companies just don't generate as much useful data as they would like to think.

Companies like Facebook, Twitter and NetFlix are dealing with data on an order of magnitude far beyond what most people will see.

For high end web analytics work then yes, you might want a NOSQL product as a collector. Cassandra is useful in that respect, plus it has tunable consistency.

The gotchas in distributed systems is Brewers CAP theorem. You can have any two of Consistency, Availability or Partition tolerance. It's a bit more blurred than that but the 2 of 3 rule is generally true.

There's the challenge of how you handle referential integrity and in some cases how do you honour primary key constraints. Your data might be splashed across many servers so honouring a foreign key concept might require a lookup between one server and another. Even if this were a feature it would seriously impact performance.

If people want to interact with an API that only talks JSON then I see no problem with that providing there is an underlying DAL that obfuscates whether a data attribute is present in a JSON document or explicit attribute in a hybrid design.

Related Question