One of the most common error you’ll encounter when you try to join a SQL Server instance as a replica in an Always On Availability Group is the dreaded Error 35250. You’ll find a ton of blog posts and articles on how to resolve this particular error.
The goal of this blog post is not to provide a resolution just like every other blog posts and articles on the internet. Rather, I’d like to help you understand why this specific error occurs using a very common model that we are very familiar with yet barely pay attention to. By understanding the model, you can prevent the issue from happening instead of trying to resolve it when it occurs. It’s also a good model to use in case you already encountered the issue and you haven’t read this blog post yet.
The Client-Server Model
The client-server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients. I’m pretty sure you are familiar with this since most of the tasks we do on our smartphones and computers use this model. But let’s make this even more relevant to SQL Server administrators. Because this is the very model that SQL Server uses.
SQL Server acts as a server – a database server, to be more specific. Client applications like SQL Server Management Studio, your web application, your mobile application, your middle-tier, etc. connect to SQL Server to “request something” like data from or write data to a table. Since the server is the one providing a service, the client application is the one initiating the request.
In order for SQL Server (or the server) to successfully service requests from client applications, there are four (4) things that the client applications need. These 4 things provide insights into figuring out why you encounter the dreaded Error 35250.
- Fully Qualified Domain Name or IP address. Since the service is located somewhere in the network, the client application has to be able find it. This can be done via either a fully qualified domain name (FQDN) or an IP address. I prefer FQDNs since I can’t memorize IP addresses (who knows what public IP address Google or Microsoft will be using tomorrow). That’s why I’m a big fan of standard naming conventions. Let’s say you want to connect to a default SQL Server instance on the server named TDPRD021, the client application needs to be able to find where TDPRD021 is on the network. A simple way to check whether the client application can find the server on the network is by doing a PING test.
- Port number. A server can be running multiple services simultaneously. Your SQL Server machine is also running a Windows Update service, Network Location Awareness (NLA) service and several others. But we’re not concerned about those. What we’re interested in is the SQL Server service, the one responsible for serving database requests. Since the client application can find the server in the network via the FQDN or IP address, it needs to be able to connect to the specific service that it needs – in the case of SQL Server, the database engine service. To distinguish the database engine service from all the other services running on the server, a port number is used. You can think of the FQDN or IP address as the main telephone trunk line while the port number as the line extension of the specific person. If a client application wants to call SQL Server on the server named TDPRD021, it has to dial the IP address and request to connect to the port number.
SQL Server’s default port number is 1433. For a client application to successfully connect to the SQL Server database engine service, it has to connect to the FQDN or IP address and the corresponding port number, 1433.
- Authentication. Connectivity to the server isn’t enough for a client application to make a request. For critical services that need to be protected, the client application needs to prove that it is who it says it is. This is where authentication comes in. Authentication is the process of proving that a certain claim is true and valid. So, when a client application connects to a SQL Server database, it has to provide an identity that exists in SQL Server – either a Windows login or a SQL Server login. If this identity does not exist, the client application won’t be allowed to connect. This is where you’ll get error 18456 in SQL Server. Of course, the error message returned to the client application is obfuscated for security reasons.
- Authorization. Similarly, connectivity and authentication are not enough for a client application to make a request. Let’s say the client application managed to find the SQL Server instance running on the machine named TDPRD021, connected to the correct port number and got authenticated. If an INSERT statement is executed and the identity that the client application used only has db_datareader permissions, the request won’t be completed. The identity needs to be granted the db_datawriter permissions to complete the request successfully. In other words, the identity needs to be authorized to do what it needs to do.
The Client-Server Model Applied to Always On Availability Group Replicas
Now that you have an understanding of the client-server model, let’s see how this applies to properly configuring an Always On Availability Group replica and avoid the dreaded error 35250. I like to think of an Always On Availability Group topology in light of the client-server model, where the primary replica acts as the server and the secondary replicas act as clients. In the diagram below, SERVER1 will be configured as the primary replica and SERVER2 as the secondary replica.
In order to successfully join a SQL Server instance as a replica in an Always On Availability Group and avoid the dreaded error 35250, you need to test against those 4 things:
- Can the SQL Server instance resolve the FQDN or IP address? If you need to add SERVER2 to the Always On Availability Group, can it find SERVER1 on the network? A simple PING test to either the FQDN or the IP address will do the trick. Just make sure that it is the same one that you used in the ENDPOINT_URL parameter of the CREATE AVAILABILITY GROUP command.
- Can the SQL Server instance connect to the configured port number? Always On Availability Group uses port 5022 as the default port number in order for the secondary replica to connect to the primary replica. In some organizations where ICMP responses are disabled, the PING test might return a request timed out message. So, a TELNET test on the correct port number would be a more appropriate test. Run the TELNET test from the secondary replica – SERVER2 – to test connectivity to the primary replica – SERVER1. Just be sure it is the same one that you used in the LISTENER_PORT parameter of the CREATE ENDPOINT command.
This becomes a bit more challenging when you have multiple SQL Server instances in the same machine. You can only have one port 5022 per machine. Whichever SQL Server instance you configured to use port 5022 first will own that port number. Your only option is to either (1) use a different port number for the other SQL Server instances or (2) assign another IP address on the machine that you can associate with the other SQL Server instance so you can still use port 5022. One important thing that you need to remember is that a service can be uniquely identified on the network using a combination of FQDN/IP address AND a port number. Not just the FQDN/IP address or the port number but the combination of these two. So, if you want to use port 5022 for all of the SQL Server instances on a single machine, just assign more IP addresses. You just can’t have both the primary and secondary replicas of an Always On Availability Group in the same machine (this was possible in the good-old-days of database mirroring) as it defeats the purpose of providing high availability and shared nothing architecture.
There are several reasons why your TELNET test might fail and prevent the SQL Server instance from joining the Always On Availability Group. Maybe a firewall is blocking the configured port. Or the FQDN/IP address could not be resolved. Or maybe SQL Server isn’t really listening on the port number that you configured. Maybe the endpoint has not been started. You can verify this by either reviewing the SQL Server error log or using the NETSTAT command.
One that I rarely see in the field is the use of a host intrusion prevention system (HIPS) like those of McAfee and Symantec. I’ve been bitten by those in the past but there aren’t a lot of resources that talk about those cases.
Be sure to resolve connectivity-related issues first before checking the next item. These first two items could be causing issues yet they can be outside of SQL Server. You may need to get your network engineers and system administrators involved.
- Can the SQL Server instance authenticate to the configured primary replica? Since the secondary replica is acting as the client application, not only does it need to connect to the primary replica via the FQDN/IP address and port number combination but also needs to be authenticated. The identity that the secondary replica uses to connect to the primary replica is the SQL Server service account.
If you are using an Active Directory domain account for your SQL Server service account (highly recommended for operational efficiency), you need to add that as a SQL Server login on the primary replica. If you are using built-in accounts like Local System, Local Service or Network Service, you either (1) add the computer account as a SQL Server login on the primary replica if your failover cluster nodes are joined to an Active Directory domain or (2) use certificates. Using certificates for endpoint authentication is nothing new. They have been around since SQL Server 2005 when you need to configure database mirroring for workgroup machines. Today, you can even use them for SQL Server 2017 Always On Availability Groups running on Linux. I just won’t recommend them unless they are your only option due to the high possibility of configuration drift (or what I call the administrator’s amnesia) especially if you don’t have a proper change management process. Also, make sure that the encryption algorithm you used to create the endpoints are consistent across all replicas.
- Is the SQL Server instance authorized to connect to the configured primary replica? So, the secondary replica can find the primary replica on the network, can connect to the appropriate port number and is authenticated. Then comes the last item in the list: is it authorized to connect to the endpoint? This is where you grant the CONNECT permissions on the endpoint to the SQL Server service account.
Now, at any time, the Always On Availability Group can failover from the current primary to any one of the secondary replicas. When this happens, the roles switch, making the old primary replica (SERVER1) the new client and the old secondary replica (SERVER2) the new server.
You need to ask the same 4 questions to verify that the client-server model still applies. Otherwise, you’ll get a different error. At least, it’s no longer be the dreaded error 35250.
- Blog: SQL AlwaysOn: Failed to join the database to the Availability Group (Error 35250)
- Blog: Create Availability Group Fails With Error 35250 ‘Failed to join the database’
- Blog: The connection to the primary replica is not active. The command cannot be processed
- Blog: Troubleshooting AlwaysOn – Sometimes it takes many sets of eyes
- Create an Availability Group (Transact-SQL)