The general index format used by MongoDB's included storage engines as at 3.0.x (MMAPv1 and WiredTiger) is B-tree, however there are more nuances in the technical implementation.
MongoDB 3.0 introduced a storage engine API which separates the concerns of storage formats (i.e. data & index representations on disk and in memory) from the core server product. For example, WiredTiger supports index prefix compression, data compression, and more granular concurrency than MMAPv1.
It is expected that alternative storage engine implementations can (and will) differ in their underlying implementation of indexes and data storage to suit different workloads. For example, WiredTiger has support for LSM (Log Structured Merge-Trees) which is expected to be available in the MongoDB 3.2 production release. There are also alternative storage engines such as RocksDB (which uses LSM) and TokuMXse (which uses Tokutek's fractal tree storage).
I'm not aware of any pluggable storage engines for MongoDB that have been specifically optimized for storage of spatial data, but it is conceivable that one may be created.
You can combine it on 1 disk, if you wish. Not obligated to split.
Journal
Journal will take 3GB (or less than 400MB if you use --small-files option)
Journal + Pre-Allocation
Be aware. If you don't use --small-files, then at least 8GB (journal and oplog included) will be pre-allocated to your disk. This is not lost space, but just reserved to improve the speed of mongo. Using --small-files, only 1.4GB will be preallocated.
For discovering and testing purposes. Start with --small-files.
Logfiles
Logfiles will depend on the verbosity and insensitivity of the system. But for as you speak about a 8GB data disk. Then it won't be that much. Default is only some system messages and errors. (http://docs.mongodb.org/manual/reference/configuration-options/)
To let logfiles rotate, send "kill -SIGUSR1 pid" or
mongo --port 27017 --eval "db.runCommand({logRotate:1});" admin
(http://docs.mongodb.org/manual/reference/command/logRotate/). And then I delete daily via a crontab the logs older than 3 days.
Splitting on different drives
The reason is to optimize the disk for the purpose. Journal is a capped collection that just writes in sequence. So less IOPS needed. And logs are logs, just adding information. And then Data, well, reading jumps around a lot, and writing sometimes also, filling up gaps that were freed. And while journal and log are written away on other disks, the data disk doesn't lose time on that. Every little bit can help on intensive systems. The next step then is Replication and Sharding to spread the load.
On the https://university.mongodb.com, you can get more information about this if you are interested. Following M202 MongoDB Advanced Deployment and Operations now, what offers some specific information to optimize.
Best Answer
There is no prescribed need to run
db.collection.validate(true)
on a regular basis for a healthy MongoDB deployment. Validation with thetrue
or{full:true}
parameter can be resource intensive, as this iterates through the collection's data & index structures.The
validate(full)
command is typically only used as a diagnostic aid in the event of suspected local data corruption or (for WiredTiger in particular) to true up collection counts after an unclean shutdown. As at MongoDB 3.4,validate()
is a read-only command with the exception ofvalidate(true)
which will check & adjust collection counts if you are using the WiredTiger storage engine.Validation generally only surfaces obvious problems in data structures, and cannot detect all possible forms of data corruption. Successful validation can be used as a sanity check if one of your replica set members is encountering data problems that result in an obvious assertion and you want to verify if other secondaries appear to be healthy. If full validation is unsuccessful on a replica set member, the general remedy is to resync the member rather than attempting to repair (which may result in a divergence of data from the other replica set members).
If you have regular unclean shutdowns and are using WiredTiger it would definitely be advisible to run
validate
to true up collection counts after a restart as these may be inaccurate. Ideally it would be preferable to investigate and resolve the issues or practices leading to your frequent unclean shutdowns.