Replica set health

Daniel Cruz

7 years ago

Replication lag

Typically, in our replica sets, we monitoring the replication lag, that is a measure of who far a secondary is behind the primary:

Exist two types of replication:

Chained replication, configured by default. A chained replication occurs when a secondary member replicates from another secondary member instead of from the primary. Recommended for reducing the load of the primary.

Primary replication needs to explicit configure. A primary replication occurs when all secondary members replicate from the primary.

In mongo shell we get the information with rs.printSlaveReplicationInfo()

Output example

source: m1.example.net:27017

syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)

0 secs (0 hrs) behind the primary

source: m2.example.net:27017

syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)

0 secs (0 hrs) behind the primary

Output example

source: m1.example.net:27017

syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)

0 secs (0 hrs) behind the primary

source: m2.example.net:27017

syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)

0 secs (0 hrs) behind the primary

Oplog window

The other important aspect that we need to be in mind, is the Oplog window, is the time difference between the oldest and newest entries in the oplog.

The oplog window tells us the amount of time to work with secondary members, for example:

Add new secondary member
Install new OS patch
Install new MongoDB version
Full resync of the secondary member
Recover a secondary member from a failure

For show information of oplog window in mongo shell rs.printReplicationInfo()

Output example

configured oplog size:   192MB

log length start to end: 65422secs (18.17hrs)

oplog first event time:  Mon Jun 23 2014 17:47:18 GMT-0400 (EDT)

oplog last event time:   Tue Jun 24 2014 11:57:40 GMT-0400 (EDT)

now:                     Thu Jun 26 2014 14:24:39 GMT-0400 (EDT)

Output example

configured oplog size: 192MB

log length start to end: 65422secs (18.17hrs)

oplog first event time: Mon Jun 23 2014 17:47:18 GMT-0400 (EDT)

oplog last event time: Tue Jun 24 2014 11:57:40 GMT-0400 (EDT)

now: Thu Jun 26 2014 14:24:39 GMT-0400 (EDT)

Conclusion

It´s very important to monitor the health of our replica sets, the aspects to monitor are:

Oplog lag

OplogWindow

We can use monitoring tools like Nagios, Mongo Compass, Atlas, to automating the monitoring and alert if something goes wrong.