MongoDB – Use Cases for db.collection.validate()

data validationmongodbmongodb-3.2

I found on MongoDB mailing list one use case for db.collection.validate():

The count() command is generally an estimate of the current document
count, based upon statistics provided by the storage engine.

After an unclean shutdown of a mongod using the WiredTiger storage
engine, count statistics reported by count may be inaccurate. If this
does occur you can run validate() on each collection to restore the
correct statistics.

are there other use cases? Maybe regular DBA job task? We have regularly unclean shutdowns.

The validate command checks the structures within a namespace for
correctness by scanning the collection’s data and indexes. The command
returns information regarding the on-disk representation of the
collection.

The validate command can be slow, particularly on larger data sets.

Best Answer

There is no prescribed need to run db.collection.validate(true) on a regular basis for a healthy MongoDB deployment. Validation with the true or {full:true} parameter can be resource intensive, as this iterates through the collection's data & index structures.

The validate(full) command is typically only used as a diagnostic aid in the event of suspected local data corruption or (for WiredTiger in particular) to true up collection counts after an unclean shutdown. As at MongoDB 3.4, validate() is a read-only command with the exception of validate(true) which will check & adjust collection counts if you are using the WiredTiger storage engine.

Validation generally only surfaces obvious problems in data structures, and cannot detect all possible forms of data corruption. Successful validation can be used as a sanity check if one of your replica set members is encountering data problems that result in an obvious assertion and you want to verify if other secondaries appear to be healthy. If full validation is unsuccessful on a replica set member, the general remedy is to resync the member rather than attempting to repair (which may result in a divergence of data from the other replica set members).

We have regularly unclean shutdowns.

If you have regular unclean shutdowns and are using WiredTiger it would definitely be advisible to run validate to true up collection counts after a restart as these may be inaccurate. Ideally it would be preferable to investigate and resolve the issues or practices leading to your frequent unclean shutdowns.