Mongodb – How do databases store index key values (on-disk) for variable length fields

couchdbindexmongodbnosql

Context

This question pertains to the low-level implementation details of indexes in both SQL and NoSQL database systems. Actual structure of the index (B+ tree, hash, SSTable, etc.) is irrelevant as the question pertains specifically to the keys stored inside a single node of any of those implementations.

Background

In SQL (e.g. MySQL) and NoSQL (CouchDB, MongoDB, etc.) databases, when you build an index on a column or JSON document field of data, what you are actually causing the database to do is create essentially a sorted list of all those values along with a file offset into the main data file where the record pertaining to that value lives.

(For simplicity sake, I may be hand-waving away other esoteric details of specific impls)

Simple Classic SQL Example

Consider a standard SQL table that has a simple 32-bit int primary key that we create an index on, we will end up with an index on-disk of the integer keys sorted and associated with a 64-bit offset into the data file where the record lives, e.g.:

id   | offset
--------------
1    | 1375
2    | 1413
3    | 1786

The on-disk representation of the keys in the index looks something like this:

[4-bytes][8-bytes] --> 12 bytes for each indexed value

Sticking to standard rules of thumb about optimizing disk I/O with filesystems and database systems, let's say you store keys in 4KB blocks on disk, which means:

4096 bytes / 12 bytes per key = 341 keys per block

Ignoring the overall structure of the index (B+ tree, hash, sorted list, etc.) we read and write blocks of 341 keys at a time into memory and back out to disk as needed.

Example Query

Using the information from the previous section, let's say a query comes in for "id=2", the classic DB index lookup goes as follows:

  1. Read the root of the index (in this case, 1 block)
  2. Binary-search the sorted block to find the key
  3. Get the data file offset from the value
  4. Look up the record in the data file using the offset
  5. Return the data to the caller

Question Setup…

Ok, here is where the question comes together…

Step #2 is the most important part that allows these queries to execute in O(logn) time… the information has to be sorted, BUT you have to be capable of traversing the list in a quick-sort manner… more specifically, you have to be able to jump to well-defined offsets at-will to read in the index key value at that position.

After reading in the block, you have to be able to jump to the 170th position immediately, read the key value and see if what you are looking for is GT or LT that position (and so on and so on…)

The only way you would be able to jump around the data in the block like that is if the key values sizes were all well-defined, like our example above (4-bytes then 8-bytes per key).

QUESTION

Ok, so here is where I am getting stuck with efficient index design… for varchar columns in SQL databases or more specifically, totally free-form fields in document databases like CouchDB or NoSQL, where any field you want to index can be any length how do you implement the key values that are inside the blocks of the index structure you build your indices out of?

For example, let's say you use a sequential counter for an ID in CouchDB and you are indexing tweets… you'll have values that go from "1" to "100,000,000,000" after a few months.

Let's say you build the index on the database on day 1, when there are only 4 tweets in the database, CouchDB might be tempted to use the following construct for the key values inside of the index blocks:

[1-byte][8-bytes] <-- 9 bytes
4096 / 9 = 455 keys per block

At some point this breaks and you need a variable number of bytes to store your key value in the indexes.

The point is even more glaring if you decide to index a really variable-length field like a "tweet_message" or something.

With the key's themselves being totally variable length, and the database having no way to intelligently guess at some "maximum key size" when the index is created and updated, how are these keys actually stored inside the blocks representing segments of the indices in these databases?

Obviously if your keys are variable sized and you read in a block of keys, not only do you have no idea how many keys are actually in the block, but you have no idea how to jump to the middle of the list to do a binary search on them.

This is where I am getting all tripped up.

With static-typed fields in classic SQL databases (like bool, int, char, etc.) I understand the index can just pre-define the key length and stick to it… but in this world of document data stores, I am perplexed how they are efficiently modeling this data on disk such that it can still be scanned in O(logn) time and would appreciate any clarification here.

Please let me know if any clarifications are needed!

Update (Greg's Answer)

Please see my comments attached to Greg's answer. After a week more of research I think he has really stumbled upon a wonderfully simple and performant suggestion that in-practice is dead-easy to implement and use while providing big performance wins on avoiding the deserialization of key values you don't care about.

I have looked into 3 separate DBMS implementations (CouchDB, kivaloo and InnoDB) and all of them handle this issue by deserializing the entire block into the internal data structure before searching the values inside their execution environment (erlang/C).

This is what I think is so brilliant about Greg's suggestion; a normal block size of 2048 would normally have 50 or less offsets, resulting in a very small block of numbers that would need to be read in.

Update (Potential Downsides to Greg's Suggestion)

In order to best continue this dialog with myself, I realized the following downsides to this…

  1. If every "block" is headed with offset data, you couldn't allow the block size to be adjusted in the configuration later down the road as you might end up reading in data that didn't start with a header correctly or a block that contained multiple headers.

  2. If you are indexing huge key values (say someone is trying to index a column of char(8192) or blob(8192)) it is possible that keys do not fit in a single block and need to be overflowed across two blocks side by side. This means your first block would have an offset header and the second block would just immediately begin with key data.

The solution to all of this is having a fixed database block size that is not adjustable and developing header block data structures around it… for example, you fix all block sizes to 4KB (typically the most optimal anyway) and write a very small block header that includes the "block type" at the beginning. If its a normal block, then immediately after the block header should be the offsets header. If its an "overflow" type, then immediately after the block header is raw key data.

Update (Potential awesome up-side)

After the block is read in as a series of bytes and the offsets decoded; technically you could simply encode the key you are searching for to raw bytes and then do direct comparisons on the byte stream.

Once the key you are looking for is found, the pointer can be decoded and followed.

Another awesome side-effect of Greg's idea! The potential for CPU time optimization here is big enough that setting a fixed block size might be worth it just to gain all of this.

Best Answer

You can store your index as a list of fixed-size offsets into the block containing your key data. For example:

+--------------+
| 3            | number of entries
+--------------+
| 16           | offset of first key data
+--------------+
| 24           | offset of second key data
+--------------+
| 39           | offset of third key data
+--------------+
| key one |
+----------------+
| key number two |
+-----------------------+
| this is the third key |
+-----------------------+

(well, the key data would be sorted in a real example, but you get the idea).

Note that this does not necessarily reflect how index blocks are actually constructed in any database. This is merely an example of how you might organise a block of index data where the key data is of variable length.