Mongodb – Understanding MongoDB’s contra-indications for read preferences

mongodb

I'm very new to MongoDB and am reading the manual to familiarize myself with it. Would appreciate some help in understanding some points related to read preferences.

In general, do not use secondary and secondaryPreferred to provide
extra capacity for reads, because:

All members of a replica have roughly equivalent write traffic; as a
result, secondaries will service reads at roughly the same rate as
the primary.

Distributing read operations to secondaries can compromise
availability if any members of the set become unavailable because
the remaining members of the set will need to be able to handle all
application requests.

I don't agree with point no. 1, and I don't understand point no. 2. If I have three members in my replica set, then it seems like common sense that read traffic can be made one-third by distributing it evenly, provided I don't care much about data freshness. As for point no. 2, isn't that the failover situation that MongoDB is fundamentally good at solving? I mean, if the secondary dies, does the client lose all capacity to read? As I see it, there should be a new secondary should be automatically elected to replace it (and if there are none, the load should transfer to the primary).

Please help explain!

Best Answer

1) All members of a replica have roughly equivalent write traffic; as a result, secondaries will service reads at roughly the same rate as the primary.

By definition, all data-bearing members of a replica set maintain the same data set so the write traffic is roughly similar for secondaries - but not necessarily identical if compared to a primary. Depending on your workload and server configuration, it's not uncommon to see secondaries with different I/O activity than the primary.

Replica sets are designed to support redundancy and high availability as a key feature, with reads directed to the primary by default (aka a read preference of primary). Servicing secondary reads is a lower design priority than ensuring successful replication and minimizing replication lag.

One interesting caveat is that writes do not have to be applied identically on the primary and the secondaries as long as reads from a secondary always reflect a state that previously existed on the primary. In particular, writes on secondaries are applied in batches using multithreaded replication which improves the write concurrency of secondaries with the expected side effect of adding latency to secondary reads under load.

If you read from secondaries you will have to deal with the possibility of eventual consistency (if there is replication lag) or intermittent latency (if the secondary is actively applying a lot of replicated writes). Replication lag can also vary depending on which secondary you are connecting to, so results may not be as your application code expects if subsequent page requests fetch data from secondaries with significantly different lag.

I would amend this contra-indication warning to suggest that "secondaries will likely service reads at a similar or lesser rate than the primary".

2) Distributing read operations to secondaries can compromise availability if any members of the set become unavailable because the remaining members of the set will need to be able to handle all application requests.

This counter-indication is referring to capacity planning and possible consequences of failover.

If I have three members in my replica set, then it seems like common sense that read traffic can be made one-third by distributing it evenly, provided I don't care much about data freshness.

The specific warnings you highlighted are for read preferences of secondary (only read from a secondary) and secondaryPreferred (read from a secondary if available). Both of these read preferences exclude reading from a primary.

With either of these two read preferences and a healthy three member replica set, your reads could be split between the two secondaries. However, if one secondary is unavailable the surviving secondary will need to service 100% of the secondary read load rather than 50%, which could overwhelm your deployment without careful planning. This is exactly the read preference you've asked for, but perhaps not the outcome you expected.

A better alternative would be to use a read preference of nearest, which may read from the primary or from secondaries.

As for point no. 2, isn't that the failover situation that MongoDB is fundamentally good at solving?

Replication handles recovery from failover in a properly configured deployment, but capacity and deployment planning are separate exercises. The defaults are generally chosen to be reasonable, so if you want to change them it is worth understanding possible implications.

I mean, if the secondary dies, does the client lose all capacity to read? As I see it, there should be a new secondary should be automatically elected to replace it (and if there are none, the load should transfer to the primary).

A three member replica set has a fault tolerance that allows any one node to be unavailable while still maintaining a primary. If you have two nodes unavailable, the surviving node will become a secondary. With any read preference aside from primary you could still service reads in the sole survivor scenario.

For more discussion on secondary reads, see: Can I use more replica nodes to scale?.

Related Solutions

Mongodb: replica-set processing reads on the primary

I'm not sure what documentation you've read so apologies if I'm repeating anything here.

To distribute reads to secondary nodes, most drivers allow you to set a readPreference value for the current session. Clients set read preference on a per-connection basis. With slaveOk, the driver should will always send queries to the secondaries, if they're available.

Distributing reads to secondaries requires the use of ReplicaSetConnection with ReadPreference.SECONDARY.

See “rs.slaveOk()” for more information and this link.

In the mongo shell, to enable secondary reads, issue the following command :

rs.slaveOk()

The PHP documentation for it is here but I'm guessing that may be the documentation you're referring to.

As a FYI, here's an old discussion about it on the MongoDB Google Group.

If you're still having issues, I'd recommend using the MongoDB Google Group and providing some further information such as the version of MongoDB you're using, the version of the PHP driver, your log files, rs.conf() and rs.status().

As a FYI, you have to be careful with read scaling as sending too many reads to the secondaries can often result in the secondaries lagging the primary and becoming stale, thus requiring a full resync.

Mongodb – mongodump with readpref nearest

The usual reason for sharding is that your workload has exceeded the resources of a single server, so the expectation is that you would not be running mongodump via mongos for a full sharded cluster backup. Backups are instead done by stopping the balancer and then backing up a config server as well as a mongod from each shard.

As at MongoDB 3.0, mongodump does not provide an option to specify read preferences or tags. If you have specific secondaries in your DR data centre for backup purposes, you can use these mongod nodes in your backup procedure.

Recommended backup strategies are included in the MongoDB manual:

Backups with filesystem snapshots are faster to complete (and to restore) because they include all data and indexes. Backups using mongodump export the data and index definitions, but indexes will have to be rebuilt as part of the mongorestore procedure.

FYI, there is an open feature request you can watch/upvote in the MongoDB issue tracker: TOOLS-630: Allow specifying readPreference (including tags).

Best Answer

Related Solutions

Mongodb: replica-set processing reads on the primary

Mongodb – mongodump with readpref nearest

Related Question