Code Monkey home page Code Monkey logo

Comments (12)

derekcollison avatar derekcollison commented on June 8, 2024 1

Our aim is next week.

from nats-server.

kozlovic avatar kozlovic commented on June 8, 2024 1

@JohnTseng1012 Glad to know that the issue is resolved once you set the advertise and did a rolling restart. I am closing this issue now.

from nats-server.

kozlovic avatar kozlovic commented on June 8, 2024

@JohnTseng1012 How do you specify the listen specification in the gateway{} block?

from nats-server.

kozlovic avatar kozlovic commented on June 8, 2024

If you don't want to use advertise, you should set the listen config to the public address: listen: "10.xxx...."

from nats-server.

JohnTseng1012 avatar JohnTseng1012 commented on June 8, 2024

Thank you for your suggestions.
Additionally, will the "no_advertise" feature be provided in the gateway in the future? Another question is why it gets stuck after attempting to reconnect several times, requiring a one-hour wait before attempting to reconnect again.

from nats-server.

kozlovic avatar kozlovic commented on June 8, 2024

The "no advertise" does not make sense in this context. This is normally used to avoid advertising URLs to client connections. Gateways never advertise server URLs from other clusters to clients.

Have you verified that using proper "listen" specification solves your issue? You should not have non local IPs anyway. The server will detect interfaces if the specification is "any" (0.0.0.0) and should exclude local IPs. We may need to run a test on those machines to see what is being returned by

func (s *Server) getNonLocalIPsIfHostIsIPAny(host string, all bool) (bool, []string, error) {
.

If you specify hostname (which does not look like you do) and it was to resolve to an internal IP, that could also explain.

As for the reason it blocked, not sure at all. Maybe the pending PR (#5356) may help?

from nats-server.

kozlovic avatar kozlovic commented on June 8, 2024

@JohnTseng1012 I tried even with the older server v2.5.0, and it seems to work fine. Again, my guess is that you are not specifying the "listen" option and therefore the server finds the interfaces and pick the first one, which may be the 172.xx that you are referring to as internal. You can see if you run the server with -D debug flag an output such as:

[3180] 2024/04/25 12:28:49.617175 [DBG] Get non local IPs for "0.0.0.0"
[3180] 2024/04/25 12:28:49.617418 [DBG]   ip=<some IP>
[3180] 2024/04/25 12:28:49.617422 [DBG]   ip=<some IP>
..
[3180] 2024/04/25 12:28:49.617532 [INF] Server is ready
[3180] 2024/04/25 12:28:49.617577 [INF] Cluster name is WEST

If the first on the list is a 172. then yes, it will be used as the listen specification when sending to others. So the simple solution is to use the public address in the "listen" specification.

You can check your logs and see what address is being used. You should see something like:

 Address for gateway "<gateway name" is <IP>

Again, if this IP is 172.x, then that means that it was the first in the list of returned interfaces.

from nats-server.

JohnTseng1012 avatar JohnTseng1012 commented on June 8, 2024

@kozlovic I have set the listen , but the logs show the following message:

[FTL] Error listening on gateway port: 7522 - listen tcp 10.XXX.XXX.XXX:7522: bind: cannot assign requested address

my setting (10.XXX type is LoadBalancer)

gateway {
  name: " cluster-A"
  listen: "10.XXX.XXX.XXX:7522"
  gateways: [
    {
      name: " cluster-A"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-B"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    },
    {
      name: " cluster-C"
      urls: ["10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522", "10.XXX.XXX.XXX:7522" ]
    }
  ]
}

Is there something configured incorrectly?

And I think PR (#5356) should be able to solve the issue with the reconnection getting stuck.

from nats-server.

kozlovic avatar kozlovic commented on June 8, 2024

@JohnTseng1012 We usually don't recommend load balancers between NATS Server(s)/client(s). Now that I understand that this address is the one from the load balancer, obviously the "listen" specification with this address won't work. Instead, specify "listen" with the IP address of this machine and use "advertise: 10.xxx" so that this is the address sent, not the actual IP the server is listening to. Do that for all servers in the clusters.

from nats-server.

JohnTseng1012 avatar JohnTseng1012 commented on June 8, 2024

@kozlovic Is rebuilding the super cluster the only way to remove the internal private IPs from the IP list used by s.getRandomIP? After adding the advertise, I am still seeing internal private IPs. Is there any way to use only the advertise and the gateway URLs that I have configured myself?

from nats-server.

kozlovic avatar kozlovic commented on June 8, 2024

The s.getRandomIP has nothing to do with this if you never specify a host name, just IPs.

So I have tested with both current main and back to v2.5.0 since this is the version you are using (you should upgrade, this is no longer supported). You don't actually have to set the "listen", but if you don't, by default the server will listen to 0.0.0.0 and get all interfaces and select one as the URL to send to its peers so that each server can "augment" the list of URLs this cluster can be reached at.

This is why you see (by printing the list of URLs before a server tries to connect) that there are some IPs that you consider internal (but they are non local from getNonLocalIPsIfHostIsIPAny() perspective).

When later you set the "advertise" config option to a "public" IP:port in say cluster1-server1, and restart that server, that server will now advertise this address to its peer, but the other servers still have their "internal" IPs communicated to other. You need to make this update (adding advertise) to all servers in the first cluster and do a rolling update. Then move to the second cluster and do the same rolling update (that is, update a server and restart it, move to the next), finally to the third cluster. It could be enough that they all cleared the "internal" IPs from their list, but it is possible that you need to do a rolling restart of each cluster to fully clear it. Of course, if you can "afford" it, then you could shutdown all servers, do the config updates, then restart the whole super cluster.

Let me know if that helps resolve your issue and I will close this ticket. Thanks!

from nats-server.

JohnTseng1012 avatar JohnTseng1012 commented on June 8, 2024

Thank you, after adding the advertise and rolling restarting all clusters, the 172.XXX are no longer appearing. Additionally, we have tested 2.10.15-RC, and it has resolved the issue of connections getting stuck. When will v2.10.15 be released?

from nats-server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.