MongoDB ReplicaSet instance re-join after Kubernetes POD restart

mongodbreplication

I have simple MongoDB deployment in Kubernetes with PersistentVolumeClaims. Roughly like this:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  name: mongodb
  labels:
    name: mongo
spec:
  clusterIP: None
  selector:
    role: mongo
  ports:
  - port: 27017
    name: mongo
    protocol: TCP
    targetPort: 27017
  - name: metrics
    port: 9216
    protocol: TCP
    targetPort: http
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongod
...
spec:
  serviceName: mongodb
  podManagementPolicy: OrderedReady
  replicas: 3
  selector:
    matchLabels:
      role: mongo
  template:
    metadata:
      labels:
        role: mongo
...
      containers:
        - name: mongod-container
          image: mongo
          ports:
            - name: mongodb
              containerPort: 27017
          readinessProbe:
            tcpSocket: 
               port: mongodb         
...
  volumeClaimTemplates:
  - metadata:
      name: mongodb-claim

Once all the three replicas are ready, I initialize them with (please mind I don't use FQDN):

rs.initiate({_id: "someId", version: 1, members: [
    { _id: 0, host : "mongod-0.mongodb:27017" },
    { _id: 1, host : "mongod-1.mongodb:27017" },
    { _id: 2, host : "mongod-2.mongodb:27017" }
]})

When one of the POD dies and re-creates, it usually doesn't join the replica back.
It doesn't matter if accompanying service has anything like: service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" or publishNotReadyAddresses: true. After restart it simply stays in following state:

restarted-mongo> rs.status()
...
MongoDB server version: 4.0.9
{
        "operationTime" : Timestamp(1556193985, 1),
        "ok" : 0,
        "errmsg" : "Our replica set config is invalid or we are not a member of it",
        "code" : 93,
        "codeName" : "InvalidReplicaSetConfig",
        "$clusterTime" : {
...
        }
}

Other Mongo PODs report:

healthy-mongo> rs.status()
...
                       "_id" : 1,
                        "name" : "mongod-1.mongodb:27017",
                        "health" : 0,
                        "state" : 8,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDurable" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2019-04-25T12:13:05.255Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "lastHeartbeatMessage" : "Our replica set configuration is invalid or does not include us",
                        "syncingTo" : "",
                        "syncSourceHost" : "",
                        "syncSourceId" : -1,
                        "infoMessage" : "",
                        "configVersion" : -1
                },

I found some information on the MongoDB Jira that the mongo initially tries to resolve its own hostname (isSelf() method) and if that fails it stucks (and it can fail due to 10000 reasons and probes, or unready toleration won't help).

However when I initialize/reconfigure the ReplicaSet like that (mind the "FQDN" usage without trailing dot):

rs.initiate({_id: "someId", version: 1, members: [
    { _id: 0, host : "mongod-0.mongodb.default.svc.cluster.local:27017" },
    { _id: 1, host : "mongod-1.mongodb.default.svc.cluster.local:27017" },
    { _id: 2, host : "mongod-2.mongodb.default.svc.cluster.local:27017" }
]})

Magically the mongo can withstand the POD restart and I guess (hard to be 100% sure) the problem is gone.

Could someone explain to me:
1. can I assume this is the 'fix' ?
2. If so, why does it work? What changes when the FQDN is used in this example ?

Edit 1:
I think I have some more insight of what is actually going on.

Upon init, restarting mongo instance tries to find itself in the already configured replicaSet list:
https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/repl_set_config_checks.cpp#L59
https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/isself.cpp#L158

Roughly, the restarting mongo instance foreach configured replica performs:
1. gets it's own IP addresses
2. resolves replica name
3. compares resolved IP with own IP

The resolution fails as the Kubernetes DNS is not yet refreshed
The mongodb's find logic is not retried and the restarting mongo instance ends up in "Our replica set configuration is invalid or does not include us" state

I guest that the final solution for this issue:
https://jira.mongodb.org/browse/SERVER-40159

However this issue surprisingly hasn't gained traction…
Moreover I investigated 'FQDN' usage more, when using "proper" FQDN with trailing dot, the restarting replica doesn't join back the replica set.
Without trailing dot and full search domain it works (or at least seems so)….

To sum up:
– could someone explain why the FQDN without trailing dot seems to work (or am I that lucky that it always works for me…)
– could someone confirm that the isSelf mongodb's issue is the solution here?

Best Answer

just to note the issue is still there using mongoDB 4.2.6 and as it is said in this post is only solved by providing a FQDN when configuring the ReplicaSet