Postgresql – Using MongoDB and PostgreSQL together

document-orientedmongodbpostgresqlrdbms

My current project is essentially a run of the mill document management system.

That said, there are some wrinkles (surprise, surprise). While some of
the wrinkles are fairly specific to the project, I believe there are
some general observations and questions that have come up which don't
have a canonical answer (that I could find, anyway) and that are
applicable to a wider problem domain. There's a lot here and I'm not
sure it's a good fit for the StackExchange Q&A format but I think it a) an answerable question and b) non-specific enough that it can benefit the community. Some of my considerations are specific to me but I think the question could be of use to anyone faced with deciding on SQL vs NoSQL vs both.

The background:

The web app we are building contains data that is clearly relational in
nature as well as data that is document-oriented. We would like to
have our cake and eat it too.

TL;DR: I think #5 below passes the smell test. Do you? Does anyone
have experience with such an integration of SQL and NOSQL in a single
application? I tried to list all the possible approaches to this class
of problem in below. Have I missed a promising alternative?

Complexities:

  • There are many different classes of documents. The requirements
    already call for dozens of different documents. This number will
    only ever go up. The best possible case would be one in which we
    could leverage a simple domain specific language, code generation
    and a flexible schema so that domain experts could handle the
    addition of new document classes without the intervention of DBAs or
    programmers. (Note: already aware we are living out Greenspun's
    Tenth Rule
    )
  • The integrity of previous successful writes is a central requirement
    of the project. The data will be business critical. Full ACID
    semantics on writes can be sacrificed provided that the things that
    do get succesfully written stay written.
  • The documents are themselves complex. The prototype document in our
    specific case will require storage of 150+ distinct pieces of data
    per document instance. The pathological case could be an order of
    magnitude worse, but certainly not two.
  • A single class of documents is a moving target subject to updates at
    a later point in time.
  • We like the free stuff we get from Django when we hook it into a
    relational database. We would like to keep the freebies without
    having to jump back two Django versions to use the django-nonrel
    fork. Dumping the ORM entirely is preferable to downgrading to 1.3.

Essentially, it's a mishmash of relational data (your typical web app
stuff like users, groups, etc., as well as document metadata that
we'll need to be able to slice and dice with complex queries in
realtime) and document data (e.g. the hundreds of fields which we have
no interest in joining on or querying by – our only use case for the
data will be for showing the single document into which it was
entered).

I wanted to do a sanity check (if you check my posting history, I'm
pretty explicit about the fact that I am not a DBA) on my preferred
method as well as enumerate all of the options I've come across for
others solving broadly similar problems involving both relational and
non-relational data.

Proposed Solutions:

1. One table per document class

Each document class gets its own table, with columns for all metadata and data.

Advantages:

  • The standard SQL data model is in play.
  • Relational data is handled in the best possible way. We'll
    denormalize later if we need to.
  • Django's built-in admin interface is comfortable with introspecting
    these tables and the ORM can live happily with 100% the data out of
    the box.

Disadvantages:

  • Maintenance nightmare. Dozens (hundreds?) of tables with (tens of?)
    thousands of columns.
  • Application-level logic responsible for deciding exactly which table
    to write to. Making the table name a parameter for a query stinks.
  • Basically all business logic changes will require schema changes.
  • Pathological cases might require striping data for single forms
    across multiple tables (see: What is the maximum number of columns
    in a PostgreSQL table?
    ).
  • We would probably need to go find a real, honest-to-God DBA who
    would no doubt end up hating life and us.

2. EAV modeling

There is just a fields table. Entity-Attribute-Value modeling is already
well understood. I've included it for completeness. I don't think any
new project being started in 2013 would go with an EAV approach on
purpose.

Advantages:

  • Easy to model.

Disadvantages:

  • More difficult to query.
  • DB layer no longer has a straight-forward representation for what
    constitutes one app-level object.
  • We would lose DB-level constraint checking.
  • Number of rows on one table will grow 100s-1000s of times
    faster. Likely future pain point, performance-wise.
  • Limited indexing possible.
  • DB schema is nonsensical as far as ORM is concerned. Batteries
    included web app stuff is preserved but custom data models are going
    to require custom queries.

3. Use PostgreSQL hstore or json fields

Either of these field types would do the trick for storing schemaless
data within the context of a relational DB. The only reason I don't
jump to this solution immediately is it is relatively new (introduced
in version 8.4 so not that new), I have zero previous exposure to it
and I am suspicious. It strikes me as wrong for precisely the same
reasons I would feel uneasy throwing all my nice, easily normalized
data into Mongo – even though Mongo can handle references between
documents.

Advantages:

  • We get the benefits of the Django ORM and the built-in auth and
    session management.
  • Everything stays in one backend that we've previously used on other
    projects successfully.

Disadvantages:

  • No experience with this, personally.
  • It doesn't look like a very highly used feature. It looks like they
    get recommended quite a bit to people looking at NOSQL solutions but
    I don't see a lot of evidence that they are being chosen. This makes
    me think I must be missing something.
  • All values stored are strings. Lose DB-level constraint checking.
  • The data in the hstore will never be displayed to the user unless
    they specifically view a document, but the metadata stored in more
    standard columns will be. We will be beating that metadata
    up and I worry the rather large hstores we will be creating might
    come with performance drawbacks.

4. Go full bore document-oriented

Make all the things documents (in the MongoDB sense). Create a single
collection of type Document and call it a day. Bring all peripheral
data (including data on user accounts, groups, etc) into mongo as
well. This solution is obviously better than EAV modeling but it feels
wrong to me for the same reason #3 felt wrong – they both feel like
using your hammer as a screwdriver too.

Advantages:

  • No need to model data upfront. Have one collection with documents of
    type Document and call it a day.
  • Known good scaling characteristics, should the collection need to
    grow to encompass millions or even billions of documents.
  • JSON format (BSON) is intuitive for developers.
  • As I understand it (which is only vaguely at this point), by being
    paranoid with regard to write-concern level even a single instance
    can provide pretty strong data safety in the event of anything and
    everything up to a hard drive crash.

Disadvantages:

  • The ORM is out the window for Django trunk. Freebies that go out the
    window with it: the auth framework, the sessions framework, the
    admin interface, surely many other things.
  • Must either use mongo's referencing capabilities (which require
    multiple queries) or denormalize data. Not only do we lose freebies
    that we got from Django, we also lose freebies like JOINs we took
    for granted in PostgreSQL.
  • Data safety. When one reads about MongoDB, it seems there is always
    at least one person referring to how it will up and lose your
    data. They never cite a particular occurrence and it might all just
    be hogwash or just related to the old default fire and forget
    write-concern but it still worries me. We will of course be
    utilizing a fairly paranoid backup strategy in any case (if data is
    corrupted silently that could well be immaterial of course..).

5. PostgreSQL and MongoDB

Relational data goes in the relational database and document data goes
in the document-oriented database. The documents table on the
relational database contains all of the data we might need to index or
slice and dice on as well as a MongoDB ObjectId which we would use
when we needed to query for the actual values of the fields on the
documents. We wouldn't be able to use the ORM or the built-in admin
for the values of the documents themselves but that's not that big of
a loss since the whole app is basically an admin interface for the
documents and we would have likely had to customize that specific part
of the ORM to an unacceptable degree to make it work just the way we
need.

Advantages:

  • Each backend does only what it is good at.
  • References between models are preserved without requiring multiple
    queries.
  • We get to keep the batteries Django gave us as far as users,
    sessions, etc are concerned.
  • Only need one documents table no matter how many different classes
    of documents are created.
  • The less often queried document data is strongly separated from the
    far more often queried metadata.

Disadvantages:

  • Retrieving document data will require 2 sequential queries, first
    against the SQL DB and then against the MongoDB (though this is no
    worse than if the same data had been stored in Mongo and not
    denormalized)
  • Writing will no longer be atomic. A write against a single Mongo
    document is guaranteed to be atomic and PG obviously can make
    atomicity guarantees but ensuring atomicity of write across both
    will require application logic, no doubt with a performance and
    complexity penalty.
  • Two backends = two query languages = two different programs with
    dissimilar admin requirements = two databases vying for memory.

Best Answer

Some thoughts....

Typically one does not want to store pieces of tightly interrelated information in different systems. The chances of things getting out of sync is significant and now instead of one problem on your hands you have two. One thing you can do with Mongo though is use it to pipeline your data in or data out. My preference is to keep everything in PostgreSQL to the extent this is possible. However, I would note that doing so really requires expert knowledge of PostgreSQL programming and is not for shops unwilling to dedicate to using advanced features. I see a somewhat different set of options than you do. Since my preference is not something I see listed I will give it to you.

You can probably separate your metadata into common data, data required for classes, and document data. In this regard you would have a general catalog table with the basic common information plus one table per class. In this table you would have an hstore, json, or xml field which would store the rest of the data along with columns where you are storing data that must be constrained significantly. This would reduce what you need to put in these tables per class, but would allow you to leverage constraints however you like. The three options have different issues and are worth considering separately:

hstore is relatively limited but also used by a lot of people. It isn't extremely new but it only is a key/value store, and is incapable of nested data structures, unlike json and xml.

json is quite new and doesn't really do a lot right now. This doesn't mean you can't do a lot with it, but you aren't going to do a lot out of the box. If you do you can expect to do a significant amount of programming, probably in plv8js or, if you want to stick with older environments, plperlu or plpython. json is better supported in 9.3 though at least in current development snapshots, so when that version is released things will get better.

xml is the best supported of the three, with the most features, and the longest support history. Then again, it is XML.....

However if you do decide to go with Mongo and PostgreSQL together, note that PostgreSQL supports 2 phase commit meaning you can run the write operations, then issue PREPARE TRANSACTION and if this succeeds do your atomic writes in Mongo. If that succeeds you can then COMMIT in PostgreSQL.