divided linked to some info that explains the basic issue (there's performance differences), but it's not simple enough to say that one's always better than the other. (otherwise, there'd be no reason to have both.) Also, in MyISM, the 64k max size for VARCHAR isn't per field -- it's per record.
Basically, there's 4 ways to store strings in database records:
- fixed length
- C-style strings (marked with a NULL or similar character at the end of the string)
- Pascal style strings (a few bytes to indicate length, then the string)
- Pointers (store the string somewhere else)
MyISM uses something similar to #3 for VARCHAR, and a hybrid approach for TEXT where it stores the beginning of the string in the record, then rest of the string somewhere else.
InnoDB is similar for VARCHAR, but stores the complete TEXT field outside of the record.
With 1&4, the stuff in the record is always the same length, so it's easier to skip over if you don't need the string, but need stuff after it. Both #2 and #3 aren't too bad for short strings ... #2 has to keep looking for the marker, while #3 can skip ahead ... as the strings get longer, #2 gets worse for this particular use case.
If you actually need to read the string, #4 is slower, as you have to read the record, then read the string which might be stored elsewhere on the disk, depending on just how that database handles it. #1 is always pretty straightforward, and again you run into similar issues where for #2 gets worse the longer the string is, while #3 is a little worse than #2 for very small strings, but better as it gets longer.
Then there's storage requirements ... #1 is always a fixed length, so it might have bloat if most strings aren't the max length. #2 has 1 extra byte; #3 typically has 2 extra bytes if max length = 255, 4 extra bytes if a 64k max. #4 has the pointer length, plus the rules for #3 typically.
For the specific implementations within MySQL 5.1, the docs for MyISM state:
- Support for a true VARCHAR type; a VARCHAR column starts with a length stored in one or two bytes.
- Tables with VARCHAR columns may have fixed or dynamic row length.
- The sum of the lengths of the VARCHAR and CHAR columns in a table may be up to 64KB.
While for InnoDB :
- The variable-length part of the record header contains a bit vector for indicating NULL columns. If the number of columns in the index that can be NULL is N, the bit vector occupies CEILING(N/8) bytes. (For example, if there are anywhere from 9 to 15 columns that can be NULL, the bit vector uses two bytes.) Columns that are NULL do not occupy space other than the bit in this vector. The variable-length part of the header also contains the lengths of variable-length columns. Each length takes one or two bytes, depending on the maximum length of the column. If all columns in the index are NOT NULL and have a fixed length, the record header has no variable-length part.
- For each non-NULL variable-length field, the record header contains the length of the column in one or two bytes. Two bytes will only be needed if part of the column is stored externally in overflow pages or the maximum length exceeds 255 bytes and the actual length exceeds 127 bytes. For an externally stored column, the two-byte length indicates the length of the internally stored part plus the 20-byte pointer to the externally stored part. The internal part is 768 bytes, so the length is 768+20. The 20-byte pointer stores the true length of the column.
...
as with so many other things when dealing with databases, if you're not sure what's best for your needs, try benchmarking it with similar data & usage, and see how they behave.
Some thoughts....
Typically one does not want to store pieces of tightly interrelated information in different systems. The chances of things getting out of sync is significant and now instead of one problem on your hands you have two. One thing you can do with Mongo though is use it to pipeline your data in or data out. My preference is to keep everything in PostgreSQL to the extent this is possible. However, I would note that doing so really requires expert knowledge of PostgreSQL programming and is not for shops unwilling to dedicate to using advanced features. I see a somewhat different set of options than you do. Since my preference is not something I see listed I will give it to you.
You can probably separate your metadata into common data, data required for classes, and document data. In this regard you would have a general catalog table with the basic common information plus one table per class. In this table you would have an hstore, json, or xml field which would store the rest of the data along with columns where you are storing data that must be constrained significantly. This would reduce what you need to put in these tables per class, but would allow you to leverage constraints however you like. The three options have different issues and are worth considering separately:
hstore is relatively limited but also used by a lot of people. It isn't extremely new but it only is a key/value store, and is incapable of nested data structures, unlike json and xml.
json is quite new and doesn't really do a lot right now. This doesn't mean you can't do a lot with it, but you aren't going to do a lot out of the box. If you do you can expect to do a significant amount of programming, probably in plv8js or, if you want to stick with older environments, plperlu or plpython. json
is better supported in 9.3 though at least in current development snapshots, so when that version is released things will get better.
xml is the best supported of the three, with the most features, and the longest support history. Then again, it is XML.....
However if you do decide to go with Mongo and PostgreSQL together, note that PostgreSQL supports 2 phase commit meaning you can run the write operations, then issue PREPARE TRANSACTION
and if this succeeds do your atomic writes in Mongo. If that succeeds you can then COMMIT
in PostgreSQL.
Best Answer
Since a query that would utilize the "date" index will implicitly contain a criteria for that "date" field (value(s) of certain type(s)), Mongo will decide what index/part of the index to use, making the fact of having multiple types of values in the same field moot (as far as indexing is concerned).
Here's an example:
I've added the following data to my Mongo 2.4.4 test db:
Before index creation, any query on the date field (be it with a ISODate or a string criteria) will search through all the 4 documents, this can be confirmed by using
explain()
:Once I create the index via
ensureIndex("date")
, doing either a query of type date:"some string", date:ISODAte(...) or date in some range of ISODate values will only look through the pertinent values (2 for each of these queries):This is made more clear if we add even a 3rd type of value for date (numerical) and create something like this:
Doing an $or seach for (let's say) ISODate and numerical values (exact match) and then doing an explain on that search will show that only 2 documents are scanned (out of the 6 indexed), showing that having mixed types of values for a field does not lower the efficiency of the index:
Doing a similar $or search with ranges (for both ISODate and numerical) will show that 4 results are obtained after only scanning 4 documents (the string values to not influence the utilization of the index):