Mongodb – Get latest data for specific identifier

mongodbmongodb-4.0performancequery-performance

I have an application that (mostly) uses only the very latest data available for a certain identifier (may consist of multiple columns/keys). However, these data may be bad or implausible. In these cases, it has to access the history of the entry to verify that value or calculate an alternative.

For the second use case I created an sorted index identifier..., time.
However, in my mind this index does not really fit the main use case, because it still have to consider n values (with different times) for each identifier.

Is there a way in mongodb to create an index on something like identifier..., $max(time)? Or isn't that necessary at all because mongodbs internal query optimization is smart enough to only look at the first/last value of the index?

Environment:

MongoDB 4.0.1
Java Applications
I have plenty of (big) collections that might need this logic (many TB of data in total)
New data are added often; roughly one entry per identifier per minute, sometimes multiple entries per second
Reads occur in roughly fixed intervals (15s) for fixed subsets of ids each (only the readers knows which identifiers he uses); only the latest data are relevant
Readers might do a cold (re-)start needing to fetch the history to better predict the future/understand the current state (also happens when the reader considers the input data as logically implausible).
Network/processing sometimes introduces delays/shuffled input values.

Alternatives considered:

Use separate tables: One for live data, one for historic data
- Decreases write performance as I have to write in two collections
- Requires some kind of transactional behavior, as I have to make sure that entries will always be added to both collections.
- Requires additional logic that makes sure that only the latest values are in the live table. Entries might change order or might arrive multiple times due to network/processing latency.
Use separate tables with some kind of insert hook
- Decreases write performance as I have to write in two collections
- Introduces additional delay due to the hook processing time
- Requires a new tool and thus increases complexity
- Requires additional logic that makes sure that only the latest values are in the live table. Entries might change order or might arrive multiple times due to network/processing latency.
- Sync/hook failures are hard to detect
Use of some kind of cache (redis)
- Requires a new tool and thus increases complexity
- Requires some kind of transactional behavior, as I have to make sure that entries will always be added to both tools.
- Requires additional logic that makes sure that only the latest values are in the live table. Entries might change order or might arrive multiple times due to network/processing latency.

EDIT

Example:

sensor_id | time    | value
1         | 12:15   | 5.12 <-- I don't need this value or any of its predecessors
1         | 12:16   | 5.15 <-- I need this value
3         | 12:16   | 1.23 <-- and I need this value
4         | 1998 AD | 4200 <-- Not needed
4         | 1999 AD | 5000 <-- but I need this value (even if is years old)

Best Answer

I have an application that (mostly) uses only the very latest data available for a certain identifier (may consist of multiple columns/keys). However, these data may be bad or implausible. In these cases, it has to access the history of the entry to verify that value or calculate an alternative.

For the second use case I created an sorted index identifier..., time. However, in my mind this index does not really fit the main use case, because it still have to consider n values (with different times) for each identifier.

Is there a way in mongodb to create an index on something like identifier..., $max(time)? Or isn't that necessary at all because mongodbs internal query optimization is smart enough to only look at the first/last value of the index?

As per MongoDB Blog documentation here By default, MongoDB generates a unique ObjectID identifier that is assigned to the _id field in a new document before writing that document to the database. In many cases the default unique identifiers assigned by MongoDB will meet application requirements. However, in some cases an application may need to create custom unique identifiers, such as:

The application may require unique identifiers with a precise number of digits. For example, unique 12 digit identifiers might be required for bank account or credit card numbers.
Unique identifiers may need to be generated in a monotonically increasing and continuous sequential order.
Unique identifiers may need to be independent of a specific database vendor. Due to the multi-threaded and distributed nature of modern applications, it is not always a straightforward task to generate unique identifiers that satisfy application requirements.

Use ObjectID as a unique identifier

MongoDB database drivers by default generate an ObjectID identifier that is assigned to the _id field of each document. In many cases the ObjectID may be used as a unique identifier in an application.

ObjectID is a 96-bit number which is composed as follows:

a 4-byte value representing the seconds since the Unix epoch (which will not run out of seconds until the year 2106)
a 3-byte machine identifier (usually derived from the MAC address)
a 2-byte process id, and
a 3-byte counter, starting with a random value.

Benefits

ObjectID is automatically generated by the database drivers, and will be assigned to the _id field of each document.
ObjectID can be considered globally unique for all practical purposes.
ObjectID encodes the timestamp of its creation time, which may be used for queries or to sort by creation time.
ObjectID is mostly monotonically increasing.
ObjectID is 96-bits, which is smaller than some (eg. 128-bit) alternative UUID implementations, which will result in slightly smaller documents that will use slightly less disk space and RAM than these alternatives.

Use a single counter document to generate unique identifiers one at a time

A counter document for unique identifier generation could look as follows:

{
    "_id"   : "UNIQUE COUNT DOCUMENT IDENTIFIER",
    "COUNT" : 0,
    "NOTES" : “Increment COUNT using findAndModify to ensure that the COUNT field will be incremented atomically with the fetch of this document",
}

And the unique identifier-generation document could be atomically requested and incremented as follows. Note that by default the document returned from findAndModify is the pre-modification document:

db.uniqueIdentifierCounter.findAndModify({
    query: { _id: "UNIQUE COUNT DOCUMENT IDENTIFIER" },
    update: {
        $inc: { COUNT: 1 },
    },
    writeConcern: 'majority'
})

Use a single counter document that allocates batches of unique identifiers

This approach is similar to the previous approach, with the difference being that instead of incrementing the COUNT value by 1, we may wish to increment it by a larger number that will represent a batch of unique identifiers that will be allocated by the database to the application.

For example, if the application knows that it needs 1000 new unique identifiers, then the application would use findAndModify() to atomically get the current COUNT and increment the COUNT value by 1000. The document returned from the findAndModify command would contain the starting value for the batch of unique identifiers, and the application would loop over 1000 values from that starting point.

var seq_increment = 1000;
db.uniqueIdentifierCounter.findAndModify({
    query: { _id: "UNIQUE COUNT DOCUMENT IDENTIFIER" },
    update: {
        $inc: {COUNT: seq_increment },
    }
    writeConcern: 'majority'
})

Basic answers

Since you select a couple of big columns an index-only scan is probably not a viable option.

This code works (if no NULL values in data!)

While the column isn't defined NOT NULL, add NULLS LAST to the sort order to make it work in any case, even with NULL values. Ideally, use the clause in the corresponding index as well:

SELECT <some big columns>
FROM   my_table_ 
ORDER  BY when_ DESC NULLS LAST
LIMIT  1;

PostgreSQL sort by datetime asc, null first?

Without any index on when_ column, does this statement require a full scan of all rows?

Yes. Without index, there is no other option left. (Well, there is also table partitioning where an index on key columns(s) is not strictly required, and it could assist with partition pruning. But you would typically have an index on key columns there, too.)

With an index on when_ column, should I change this SQL to use some other approach/strategy of query?

Basically, this is the perfect query. There are options in combination with advanced indexing:

Advanced technique

Assuming a NOT NULL column. Else, add NULLS LAST to index and queries as suggested above.

You have a constant influx of rows with later when_. Assuming the latest _when constantly increases and never (or rarely) decreases (latest rows deleted / updated), you can use a very small partial index.

Basic implementation:

Run your query once to retrieve the latest when_, subtract a safe margin (to be safe against losing the latest rows) and create an IMMUTABLE function based on it. Basically a "fake global constant":
```
CREATE OR REPLACE FUNCTION f_when_cutoff()
  RETURNS timestamptz LANGUAGE sql COST 1 IMMUTABLE PARALLEL SAFE AS
$$SELECT timestamptz '2015-07-25 01:00+02'$$;
```
PARALLEL SAFE only in Postgres 9.6 or later.
Create a partial index excluding older rows:
```
CREATE INDEX my_table_when_idx ON my_table_ (when_ DESC)
WHERE when_ > f_when_cutoff();
```
With millions of rows, the difference in size can be dramatic. And this only makes sense with a much smaller index. Just half the size or something would not cut it. Index access itself is not slowed much by a bigger index. It's mostly the sheer size of the index, which needs to be read and cached. (And possibly avoiding additional index writes, but hardly in your case.)
Use the function in all related queries. Include the same WHERE condition (even if logically redundant) to convince the query planner the index is applicable. For the simple query:
```
SELECT <some big columns>
FROM   my_table_ 
WHERE  when_ > f_when_cutoff()
ORDER  BY when_ DESC
LIMIT  1;
```

The size of the index grows with new (later) entries. Recreate the function with a later timestamp and REINDEX from time to time with no or little concurrent access. Only reindex after a relevant number of rows has been added. A couple of thousand entries won't matter much. We are doing this to cut off millions.
The beauty of it: queries don't change.

Implementation with function to update the partial index automatically:

Get latest child per parent from big table - query is too slow

More general advice:

Index optimization with dates

MongoDB Sharding – Limiting Number of Documents to Migrate

For your 7 collections on the primary shard:

Enter as administrator on the primary replica (if you shard your replica) and create your collection there by inserting 1 document or creating an index there. When you create a collection via the mongos - shard client - then the collection is started on a random shard; if you create it on the shard itself first, then you know it's on the one you want. shard1$ mongo --port 27018 localhost/mydb --eval 'db.mycol.insert({firstdocument:'hello'})'
For the 25 collections:

Standard sharding indeed as you mention. Be patient with the distribution. The first shards will start distributing after a certain amount of data, can be 100k documents if the documents are small.
For the big collection:

Have a look at shard-tags. So, you have to tag your collection on shard1 with let's say : {status:'in_use'}; shard2 is tagged {status:'freeBeer'}. Then depending on the status in the document, the shard is chosen.
For the disk issue, I think you gonna have to write a script. Get a warning when 60% of a disk is used, and another warning at 80% to take action. Then you'll need to upgrade your disk or redistribute the chunks. Certainly the 2TB collection could give you space via the tag, to send more to shard2. Or, add a shard3 ... what is actually the purpose of sharding, to grow horizontally.
Choose shardKeys well, practice configuration and setup on a small server, practice moving chunks, adding tags, ....
Have a look at the AllChunkInfo script of Adam Comerford for you scripts and testing : https://github.com/comerford/mongodb-scripts

Best Answer

Related Solutions

PostgreSQL Query Performance – How to Get Row with Latest Timestamp

Basic answers

Advanced technique

MongoDB Sharding – Limiting Number of Documents to Migrate

Related Question