MongoDB Resync stale replica member

Selcen Sahin

8 years ago

In a replica set if one of the members is in RECOVERING state and in the log file of that member if you see:

2018-02-14T16:02:50.410+0300 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode

2018-02-14T16:02:50.410+0300 I REPL     [rsBackgroundSync] our last optime : (term: 32, timestamp: Jan 12 22:42:48:1c51)

2018-02-14T16:02:50.410+0300 I REPL     [rsBackgroundSync] oldest available is (term: 38, timestamp: Jan 29 13:53:34:6f)

2018-02-14T16:02:50.410+0300 E REPL [rsBackgroundSync] too stale to catch up -- entering maintenance mode

2018-02-14T16:02:50.410+0300 I REPL [rsBackgroundSync] our last optime : (term: 32, timestamp: Jan 12 22:42:48:1c51)

2018-02-14T16:02:50.410+0300 I REPL [rsBackgroundSync] oldest available is (term: 38, timestamp: Jan 29 13:53:34:6f)

This means that the oplog size wasn’t enough to handle the time that the replica server was unreachable. A network problem, lack of disk space or a closed host may cause these type of problems .

Oplog size must be enough during problem solving process. If the problem isn’t solved during this time, the replica will come into RECOVERING state.

The only thing you can do here is resyncing the replica member from primary.

There are two ways:

1- Initial sync (with no initial data)

2- sync by copying data files from another member

In the first method steps are:

1- shutdown the server.

use admin

db.shutdownServer()

use admin

db.shutdownServer()

2- remove the old data file directory

3- create an empty data file directory with the rights.

4- start the mongod process

Status of the replica will become startup2 from recovering.

Second method:

During the syncronization process, primary may not answer the stale secondary and this causes restarting the syncronization process.

If the primary is answering other requests at the same time and syncing process is long, these interruptions may occur repeatedly. The syncronization never ends.

In secondary log:

2018-08-28T05:17:05.817+0300 I INDEX    [rsSync] build index done.  scanned 138338099 total records. 5629 secs

2018-08-28T05:17:05.827+0300 I STORAGE  [rsSync] copying indexes for: { name: "Announcement", options: {} }

2018-08-28T05:17:05.827+0300 I NETWORK  [rsSync] Socket say send() errno:32 Broken pipe xxxxxxxxxxxxxxxxxxxx

2018-08-28T05:17:05.889+0300 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server xxxxxxxxxxxxxxxxxxxxxxx

2018-08-28T05:17:05.889+0300 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining

2018-08-28T05:17:10.889+0300 I REPL     [rsSync] initial sync pending

2018-08-28T05:17:12.709+0300 I REPL     [ReplicationExecutor] syncing from: primaryserver

2018-08-28T05:17:12.747+0300 I REPL     [rsSync] initial sync drop all databases

2018-08-28T05:17:12.747+0300 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 4

2018-08-28T05:17:16.312+0300 I REPL     [rsSync] initial sync clone all databases

2018-08-28T05:17:16.364+0300 I REPL     [rsSync] fetching and creating collections for admin

2018-08-28T05:17:16.374+0300 I REPL     [rsSync] fetching and creating collections for db1

2018-08-28T05:17:16.519+0300 I REPL     [rsSync] fetching and creating collections for db2

2018-08-28T05:17:16.532+0300 I REPL     [rsSync] initial sync cloning db: admin

2018-08-28T05:17:05.817+0300 I INDEX [rsSync] build index done. scanned 138338099 total records. 5629 secs

2018-08-28T05:17:05.827+0300 I STORAGE [rsSync] copying indexes for: { name: "Announcement", options: {} }

2018-08-28T05:17:05.827+0300 I NETWORK [rsSync] Socket say send() errno:32 Broken pipe xxxxxxxxxxxxxxxxxxxx

2018-08-28T05:17:05.889+0300 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server xxxxxxxxxxxxxxxxxxxxxxx

2018-08-28T05:17:05.889+0300 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining

2018-08-28T05:17:10.889+0300 I REPL [rsSync] initial sync pending

2018-08-28T05:17:12.709+0300 I REPL [ReplicationExecutor] syncing from: primaryserver

2018-08-28T05:17:12.747+0300 I REPL [rsSync] initial sync drop all databases

2018-08-28T05:17:12.747+0300 I STORAGE [rsSync] dropAllDatabasesExceptLocal 4

2018-08-28T05:17:16.312+0300 I REPL [rsSync] initial sync clone all databases

2018-08-28T05:17:16.364+0300 I REPL [rsSync] fetching and creating collections for admin

2018-08-28T05:17:16.374+0300 I REPL [rsSync] fetching and creating collections for db1

2018-08-28T05:17:16.519+0300 I REPL [rsSync] fetching and creating collections for db2

2018-08-28T05:17:16.532+0300 I REPL [rsSync] initial sync cloning db: admin

In this situation second method is more effective.

You need to copy data files from an healthy member. Before copying you need to lock the healty secondary to prevent data files from changes.

1- In healthy secondary run:

db.fsyncLock()

1	db.fsyncLock()

2- shutdown the unhealthy mongo instance

use admin

db.shutdownServer()

use admin

db.shutdownServer()

2- cp /data –> unhealthy server

3- start the mongod instance

4- unlock the healthy secondary:

db.fsyncUnlock()

1	db.fsyncUnlock()

After copying data files replica member will syncronize itself from primary in a short time.