Mongodb – Get latest data for specific identifier

mongodbmongodb-4.0performancequery-performance

I have an application that (mostly) uses only the very latest data available for a certain identifier (may consist of multiple columns/keys). However, these data may be bad or implausible. In these cases, it has to access the history of the entry to verify that value or calculate an alternative.

For the second use case I created an sorted index identifier..., time.
However, in my mind this index does not really fit the main use case, because it still have to consider n values (with different times) for each identifier.

Is there a way in mongodb to create an index on something like identifier..., $max(time)? Or isn't that necessary at all because mongodbs internal query optimization is smart enough to only look at the first/last value of the index?

Environment:

  • MongoDB 4.0.1
  • Java Applications
  • I have plenty of (big) collections that might need this logic (many TB of data in total)
  • New data are added often; roughly one entry per identifier per minute, sometimes multiple entries per second
  • Reads occur in roughly fixed intervals (15s) for fixed subsets of ids each (only the readers knows which identifiers he uses); only the latest data are relevant
  • Readers might do a cold (re-)start needing to fetch the history to better predict the future/understand the current state (also happens when the reader considers the input data as logically implausible).
  • Network/processing sometimes introduces delays/shuffled input values.

Alternatives considered:

  • Use separate tables: One for live data, one for historic data
    • Decreases write performance as I have to write in two collections
    • Requires some kind of transactional behavior, as I have to make sure that entries will always be added to both collections.
    • Requires additional logic that makes sure that only the latest values are in the live table. Entries might change order or might arrive multiple times due to network/processing latency.
  • Use separate tables with some kind of insert hook
    • Decreases write performance as I have to write in two collections
    • Introduces additional delay due to the hook processing time
    • Requires a new tool and thus increases complexity
    • Requires additional logic that makes sure that only the latest values are in the live table. Entries might change order or might arrive multiple times due to network/processing latency.
    • Sync/hook failures are hard to detect
  • Use of some kind of cache (redis)
    • Requires a new tool and thus increases complexity
    • Requires some kind of transactional behavior, as I have to make sure that entries will always be added to both tools.
    • Requires additional logic that makes sure that only the latest values are in the live table. Entries might change order or might arrive multiple times due to network/processing latency.

EDIT

Example:

sensor_id | time    | value
1         | 12:15   | 5.12 <-- I don't need this value or any of its predecessors
1         | 12:16   | 5.15 <-- I need this value
3         | 12:16   | 1.23 <-- and I need this value
4         | 1998 AD | 4200 <-- Not needed
4         | 1999 AD | 5000 <-- but I need this value (even if is years old)

Best Answer

I have an application that (mostly) uses only the very latest data available for a certain identifier (may consist of multiple columns/keys). However, these data may be bad or implausible. In these cases, it has to access the history of the entry to verify that value or calculate an alternative.

For the second use case I created an sorted index identifier..., time. However, in my mind this index does not really fit the main use case, because it still have to consider n values (with different times) for each identifier.

Is there a way in mongodb to create an index on something like identifier..., $max(time)? Or isn't that necessary at all because mongodbs internal query optimization is smart enough to only look at the first/last value of the index?

As per MongoDB Blog documentation here By default, MongoDB generates a unique ObjectID identifier that is assigned to the _id field in a new document before writing that document to the database. In many cases the default unique identifiers assigned by MongoDB will meet application requirements. However, in some cases an application may need to create custom unique identifiers, such as:

  • The application may require unique identifiers with a precise number of digits. For example, unique 12 digit identifiers might be required for bank account or credit card numbers.
  • Unique identifiers may need to be generated in a monotonically increasing and continuous sequential order.
  • Unique identifiers may need to be independent of a specific database vendor. Due to the multi-threaded and distributed nature of modern applications, it is not always a straightforward task to generate unique identifiers that satisfy application requirements.

Use ObjectID as a unique identifier

MongoDB database drivers by default generate an ObjectID identifier that is assigned to the _id field of each document. In many cases the ObjectID may be used as a unique identifier in an application.

ObjectID is a 96-bit number which is composed as follows:

  • a 4-byte value representing the seconds since the Unix epoch (which will not run out of seconds until the year 2106)
  • a 3-byte machine identifier (usually derived from the MAC address)
  • a 2-byte process id, and
  • a 3-byte counter, starting with a random value.

Benefits

  • ObjectID is automatically generated by the database drivers, and will be assigned to the _id field of each document.
  • ObjectID can be considered globally unique for all practical purposes.
  • ObjectID encodes the timestamp of its creation time, which may be used for queries or to sort by creation time.
  • ObjectID is mostly monotonically increasing.
  • ObjectID is 96-bits, which is smaller than some (eg. 128-bit) alternative UUID implementations, which will result in slightly smaller documents that will use slightly less disk space and RAM than these alternatives.

Use a single counter document to generate unique identifiers one at a time

A counter document for unique identifier generation could look as follows:

{
    "_id"   : "UNIQUE COUNT DOCUMENT IDENTIFIER",
    "COUNT" : 0,
    "NOTES" : “Increment COUNT using findAndModify to ensure that the COUNT field will be incremented atomically with the fetch of this document",
}

And the unique identifier-generation document could be atomically requested and incremented as follows. Note that by default the document returned from findAndModify is the pre-modification document:

db.uniqueIdentifierCounter.findAndModify({
    query: { _id: "UNIQUE COUNT DOCUMENT IDENTIFIER" },
    update: {
        $inc: { COUNT: 1 },
    },
    writeConcern: 'majority'
})

Use a single counter document that allocates batches of unique identifiers

This approach is similar to the previous approach, with the difference being that instead of incrementing the COUNT value by 1, we may wish to increment it by a larger number that will represent a batch of unique identifiers that will be allocated by the database to the application.

For example, if the application knows that it needs 1000 new unique identifiers, then the application would use findAndModify() to atomically get the current COUNT and increment the COUNT value by 1000. The document returned from the findAndModify command would contain the starting value for the batch of unique identifiers, and the application would loop over 1000 values from that starting point.

var seq_increment = 1000;
db.uniqueIdentifierCounter.findAndModify({
    query: { _id: "UNIQUE COUNT DOCUMENT IDENTIFIER" },
    update: {
        $inc: {COUNT: seq_increment },
    }
    writeConcern: 'majority'
})