The reason that this fails outside of the internal Rackspace location is due to the URL Endpoints being set to a value that is not able to be connected through from your local environment.
I discuss this process at a high level in this blog post, however to quickly recap the point that needs to be made here:
The endpoint url is the address where the connection will be routed in order to connect and run their queries. [roughly speaking]
The URL is active directory specific FQDN, so I cannot use it as is
from ourside the the racksapce domain. The ROR URL is some thing like
"tcp://1234-db1.abc.intensive.int:1433"
This means that the client driver is going to be re-pointed to this address; if the address is not reachable then the driver won't be able to connect and you'll have a problem - which you happened to run into.
In your example, the client driver will get back
"tcp://1234-db1.abc.intensive.int:1433" and attempt to connect to it like it would any other instance of SQL Server. Since it can't (as
you've stated) you won't be able to be routed and your connection
should then be on the primary. What I did was I changed the ROR URL
for each node to public IP address instead of the FQDN/ Host-name, so
when the request is sent to AG listener from outside it knows where to
hand-off in the public facing IP. It looks like, if we have to connect
from outside I have to use public IP (Eg: 72.32.XX.XX) as opposed to
tcp://1234-db1.abc.intensive.int:1433 in the ROR.
While the IP may or may not need to be public (different architectures at different companies' may or may not require different things), it seems in your case it does... though I'm not a networking architecture guru so I can't comment on how/why this was implemented or any technical challenges observed.
Also when we deploy the application in the same domain as the db
servers Active Directory domain "abc.intensive.int" would that pose a
problem when apps try to connect to the db AG listener?
The domain shouldn't make a difference here, the real difference is whether or not your DNS is setup where it can do the lookups required to get the information (IIRC this would be a reverse lookup zone but don't quote me on that). If you can resolve the DNS name, it will work fine, if you can't it obviously won't... this is more of a question for your DNS/AD admins as they hold all the information to that kingdom.
However, one set has a listener created and the other doesn't.
Well that doesn't help you application stay available! You can absolutely have an availability group without a listener, but the useful of such setups are low.
(From Comments) The listener is the correct approach, but it's real function is to be able to direct intent read only connections to the proper secondary.
That's not quite correct. The listener always points to the primary replica and the main purpose of it is to facilitate transparent failover so that the connection strings do not need to change. One additional use is in read only routing as you've said but the primary use is actually for transparent failover.
(From Comments) ... but the Cluster Name provides the same seamless redirection to the primary node as the listener does.
The CNO has no idea about what an availability group is, which should be primary, etc. The CNO absolutely DOES NOT always point to the primary replica. You can connect using it only when the core cluster resources (which holds the CNO) happens to be owned by the same node as the AG primary but if it isn't - and the CNO doesn't follow the AG around - then it'll fail. Here's an example:
The one that doesn't have a listener is working fine. The one with the listener has issues at failover.
The one without the listener might be working fine... does it still work fine on a failover? Are they using some other DNS alias to "act" like the listener may.
The one with the listener, what does their connection string look like? If it isn't using the listener name in the connection string then of course it isn't going to work... they need to point it to the right place, that's why the listener exists!
The connection to the listener may also fail if there are multiple subnets, older client libraries are used, or certain keywords aren't in the connection string. If there are multiple subnets, the client driver should be something that supports the MultiSubnetFailover keyword and this should be set to TRUE.
It doesn't use the listener to connect, instead it uses the cluster name.
Well, that's 50% of the problem.
The GIS application accessing this server cannot connect to it after a failover.
If they update their connection string to use the listener and set MultiSubnetFailover = True then I'm betting it'll work... assuming the client library used to connect supports it.
I tried to have them use the listener but they said they couldn't even connect to that. I assume the listener hasn't been set up correctly.
If SQL Server created the listener for you, I highly doubt it is setup incorrectly... I won't rule out edge cases though.
Can I just remove the listener since it is not used anyways?
I would do the opposite. Have the GIS team change their connection string to the listener and set multisubnetfailover. It should, then, work... again with previously said assumptions.
But everything I read says that you need to use a listener, yet the servers without a listener are working fine. I'm so confused.
Yes, you do want to use the listener. The fact that it is working is either because there is some document that says "always have this on node3" or you've been getting lucky that it stays working. They might have used the replica name directly in the connection string also... that would allow it to connect. Depending on the settings for the secondary role it may or may not still connect and work properly... just depends on said settings.
Best Answer
Yes. Not to put a "The answer is in the blog" answer, but I did write it and it apparently survived the Blog migration fiasco. So, the answer is in this blog...
The requisite excerpt is:
Please Note: When creating the listener, you'll need to use the IP Address of the primary node. Since this does not use WSFC, any failovers will need to have the listener dropped and re-created with the current primary node's IP address to properly read only route.
Again, the requisite information from my post:
Then we get to the listener, which I will be using the local node's IP address and default port. The value of this may not even matter if it won't be registered in DNS based on what the use cases are.
Since SQL Server doesn't create DNS records (that's the job of the network name resource in WSFC) you'll need to manually create one and manually update it should you have a failover. The Listener only facilitates the need for read only routing. If you're not using read only routing then there isn't any point to creating a listener.
Then if you continue to want to use a read-scale AG and not a WSFC based one (or Pacemaker/etc.) then you'll want to have a load balancer in front of the read-scale AG that can do scripted health checks to know which IP is the primary and is healthy. Otherwise, don't use read-scale AGs.