Suddenly, AlwaysOn AG Replica was failed.

11 Jan by Ekrem Önsoy

Suddenly, AlwaysOn AG Replica was failed.

 

Last night, a customer started to receive error messages about one of the Replica in an AlwaysOn Availability Group configuration. Error messages were related to the delay of replication and the failure of the corresponding Replica. I connected to the server and started reviewing the AlwaysOn AG Dashboard and related logs. I saw that the error messages came from Replica in the NOT SYNCHRONIZING state.

I’ve seen a record in the AlwaysOn_health Extended Event file:

A connection timeout has occurred on a previously established connection to availability replica ” with id [XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

Then I tried to connect to SQL Server Instance, I couldn’t do RDP (I’ll tell you why); but I could connect with SSMS and access the service with TELNET. So the SQL Server Instance service was running. I connected to SQL Server Instance, I started to examine the Error Log. Then I saw the above error message again at the first time the problem occurred.

After this error, there were errors similar to the following:

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘DATABASE_NAME’ on the availability replica with Replica ID: {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}. This is an informational message only. No user action is required.

But in my scenario, the crucial part was the following error message:

Database Mirroring login attempt failed with error: ‘Connection handshake failed. An OS call failed: (80090311) 0x80090311(No authority could be contacted for authentication.). State 67.’.  [CLIENT: XX.X.X.XX]

When I saw this error message, I thought of an accident that had happened before in this environment.

15 days ago, this SQL Server Instance’s Computer object was accidentally deleted from Active Directory. This accident did not cause a problem at that time; But when the information about this accident came to me, I told the friends, “Maybe it doesn’t cause a problem at the moment, but surely one day it will explode.”

To resolve the problem, they are connected to the computer with the local Windows account on which SQL Server instance is located. They removed it from domain, and they added it Again.

After this process, we observed that the problem with AlwaysOn and the RDP problem were solved.

Yes, maybe there was a more practical way to rebuild the Computer object in Active Directory. But there was needed a time to examine this issue. And the person who had to work is the friend who did the accident. Because he knew exactly what he was doing at that time. It was a time outside working hours, so a more “practical” method was used, and it worked as a result.

ByEkrem Önsoy

The original article was written in Turkish by Ekrem Önsoy and translated to English by dbtut with the consent of the author. The copyright of the article belongs to the author. The author shall not be liable in any way for any defect caused by translation.

Leave a Reply

Your email address will not be published. Required fields are marked *