Code Monkey home page Code Monkey logo

Comments (6)

pqarmitage avatar pqarmitage commented on July 4, 2024

Can you please provide the output of ip addr show eth0 on both systems.

If this doesn't help identify the cause of the problem I'll provide details of how to enable the various debug options within keepalived.

from keepalived.

stanluk avatar stanluk commented on July 4, 2024

Sorry for late answer, recently we had some problems with reproducing the issue.
The below same issue with slightly different config then previously attached - like double initial master with same priority, which I know is anti-pattern, however the keepalived, as far as I tested, locally was able to recover from such wrong config. Moreover the number of reload was forced to be bigger then previous run.

The full logs from system run:

$hostname0: ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if22956: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:81:02:64 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.129.2.100/23 brd 10.129.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd01:0:0:3::598b/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe81:264/64 scope link 
       valid_lft forever preferred_lft forever
4: net1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 2a:90:6d:9a:a4:1b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.168.1.36/24 brd 172.168.1.255 scope global net1
       valid_lft forever preferred_lft forever
    inet 10.10.10.2/24 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::2890:6dff:fe9a:a41b/64 scope link 
       valid_lft forever preferred_lft forever
5: net2@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 6e:21:6f:cf:da:91 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.168.0.4/24 scope global net2
       valid_lft forever preferred_lft forever
    inet 192.168.120.2/24 scope global net2
       valid_lft forever preferred_lft forever
    inet6 fe80::6c21:6fff:fecf:da91/64 scope link 
       valid_lft forever preferred_lft forever

$hostname0: ss
Netid State  Recv-Q Send-Q     Local Address:Port Peer Address:PortProcess
???   UNCONN 0      0           0.0.0.0%eth0:vrrp      0.0.0.0:*          
???   UNCONN 0      0      10.129.2.100%eth0:vrrp      0.0.0.0:*          

$hostname0: cat /tmp/keepalived.conf
global_defs {
  vrrp_startup_delay 10.0
  vrrp_garp_interval 0.001
  vrrp_version 3
  vrrp_garp_master_refresh 30
  vrrp_garp_lower_prio_repeat 5
  vrrp_higher_prio_send_advert true
  script_user root root
  notify_fifo /tmp/notify_fifo
  notify_fifo_script /tmp/notify.sh
}
vrrp_script check_masterability {
  script "/cmds -run check-master"
  interval 1
  timeout 1
  rise 1
  fall 1
}
vrrp_script check_masterability_on_active {
  script "/cmds -run check-master-on-active"
  interval 1
  timeout 1
  rise 2
  fall 3
}

track_file drop_master {
  file "/config/drop_master"
  weight 0
  init_file 0
}

vrrp_instance VI_1 {
  advert_int 0.4
  interface eth0
  state MASTER
  unicast_src_ip 10.129.2.100
  unicast_peer {
     10.131.0.83
  }
  virtual_router_id 1
  priority 255
  virtual_ipaddress {
    192.168.120.2/24 dev net2
    10.10.10.2/24 dev net1
  }
  virtual_routes {
  }
  track_script {
    check_masterability
    check_masterability_on_active
  }
  track_interface {
    net1
    net2
  }
  track_file {
    drop_master
  }
  notify_master "/cmds -run on-master"
}

$hostname0: tcpdump proto 112
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
07:31:50.435733 IP 10.131.0.83 > svc-tcp-service-01-0: VRRPv3, Advertisement, (ttl 254), vrid 1, prio 255, intvl 40cs, length 16
07:31:50.531809 IP svc-tcp-service-01-0 > 10.131.0.83: VRRPv3, Advertisement, vrid 1, prio 255, intvl 40cs, length 16
07:31:50.835950 IP 10.131.0.83 > svc-tcp-service-01-0: VRRPv3, Advertisement, (ttl 254), vrid 1, prio 255, intvl 40cs, length 16
07:31:50.931995 IP svc-tcp-service-01-0 > 10.131.0.83: VRRPv3, Advertisement, vrid 1, prio 255, intvl 40cs, length 16
07:31:51.236236 IP 10.131.0.83 > svc-tcp-service-01-0: VRRPv3, Advertisement, (ttl 254), vrid 1, prio 255, intvl 40cs, length 
$hostname1: ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if18234: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:53 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.83/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd01:0:0:5::4714/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:53/64 scope link 
       valid_lft forever preferred_lft forever
4: net1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether d6:cc:d7:68:3a:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.168.1.48/24 brd 172.168.1.255 scope global net1
       valid_lft forever preferred_lft forever
    inet 10.10.10.2/24 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::d4cc:d7ff:fe68:3af3/64 scope link 
       valid_lft forever preferred_lft forever
5: net2@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 96:6c:04:e2:d5:28 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.168.0.3/24 scope global net2
       valid_lft forever preferred_lft forever
    inet 192.168.120.2/24 scope global net2
       valid_lft forever preferred_lft forever
    inet6 fe80::946c:4ff:fee2:d528/64 scope link 
       valid_lft forever preferred_lft forever

$hostname1: ss
Netid State  Recv-Q Send-Q    Local Address:Port Peer Address:PortProcess
???   UNCONN 0      0          0.0.0.0%eth0:vrrp      0.0.0.0:*          
???   UNCONN 0      0      10.131.0.83%eth0:vrrp      0.0.0.0:*          

$hostname1: cat /tmp/keepalived.conf
global_defs {
  vrrp_startup_delay 10.0
  vrrp_garp_interval 0.001
  vrrp_version 3
  vrrp_garp_master_refresh 30
  vrrp_garp_lower_prio_repeat 5
  vrrp_higher_prio_send_advert true
  script_user root root
  notify_fifo /tmp/notify_fifo
  notify_fifo_script /tmp/notify.sh
}
vrrp_script check_masterability {
  script "/cmds -run check-master"
  interval 1
  timeout 1
  rise 1
  fall 1
}
vrrp_script check_masterability_on_active {
  script "/cmds -run check-master-on-active"
  interval 1
  timeout 1
  rise 2
  fall 3
}

track_file drop_master {
  file "/config/drop_master"
  weight 0
  init_file 0
}

vrrp_instance VI_1 {
  advert_int 0.4
  interface eth0
  state MASTER
  unicast_src_ip 10.131.0.83
  unicast_peer {
     10.129.2.100
  }
  virtual_router_id 1
  priority 255
  virtual_ipaddress {
    192.168.120.2/24 dev net2
    10.10.10.2/24 dev net1
  }
  virtual_routes {
  }
  track_script {
    check_masterability
    check_masterability_on_active
  }
  track_interface {
    net1
    net2
  }
  track_file {
    drop_master
  }
  notify_master "/cmds -run on-master"
}

$hostname1: tcpdump proto 112
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
07:31:12.814505 IP svc-tcp-service-01-1 > 10.129.2.100: VRRPv3, Advertisement, vrid 1, prio 255, intvl 40cs, length 16
07:31:12.910542 IP 10.129.2.100 > svc-tcp-service-01-1: VRRPv3, Advertisement, (ttl 254), vrid 1, prio 255, intvl 40cs, length 16
07:31:13.214688 IP svc-tcp-service-01-1 > 10.129.2.100: VRRPv3, Advertisement, vrid 1, prio 255, intvl 40cs, length 16
07:31:13.310866 IP 10.129.2.100 > svc-tcp-service-01-1: VRRPv3, Advertisement, (ttl 254), vrid 1, prio 255, intvl 40cs, length 16
07:31:13.615045 IP svc-tcp-service-01-1 > 10.129.2.100: VRRPv3, Advertisement, vrid 1, prio 255, intvl 40cs, length 16
07:31:13.711201 IP 10.129.2.100 > svc-tcp-service-01-1: VRRPv3, Advertisement, (ttl 254), vrid 1, prio 255, intvl 40cs, length 16
07:31:14.015266 IP svc-tcp-service-01-1 > 10.129.2.100: VRRPv3, Advertisement, vrid 1, prio 255, intvl 40cs, length 16

I even checked with strace and it seems that is processes:

strace: Process 89 attached
sendmsg(14, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.131.0.83")}, msg_namelen=16, msg_iov=[{iov_base="E\300\0$\17;\0\0\377p\0\0\n\201\2d\n\203\0S1\1\377\2\0(j\341\300\250x\2"..., iov_len=36}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
recvmsg(13, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.131.0.83")}, msg_namelen=28 => 16, msg_iov=[{iov_base="E\300\0$\17<\0\0\376p\224\263\n\203\0S\n\201\2d1\1\377\2\0(j\341\300\250x\2"..., iov_len=1400}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_CTRUNC|MSG_TRUNC) = 36
recvmsg(13, {msg_namelen=16}, MSG_CTRUNC|MSG_TRUNC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(14, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.131.0.83")}, msg_namelen=16, msg_iov=[{iov_base="E\300\0$\17<\0\0\377p\0\0\n\201\2d\n\203\0S1\1\377\2\0(j\341\300\250x\2"..., iov_len=36}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
recvmsg(13, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.131.0.83")}, msg_namelen=28 => 16, msg_iov=[{iov_base="E\300\0$\17=\0\0\376p\224\262\n\203\0S\n\201\2d1\1\377\2\0(j\341\300\250x\2"..., iov_len=1400}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_CTRUNC|MSG_TRUNC) = 36

strace: Process 87 attached
recvmsg(13, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.129.2.100")}, msg_namelen=28 => 16, msg_iov=[{iov_base="E\300\0$\16r\0\0\376p\225}\n\201\2d\n\203\0S1\1\377\2\0(j\341\300\250x\2"..., iov_len=1400}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_CTRUNC|MSG_TRUNC) = 36
recvmsg(13, {msg_namelen=16}, MSG_CTRUNC|MSG_TRUNC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(14, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.129.2.100")}, msg_namelen=16, msg_iov=[{iov_base="E\300\0$\16s\0\0\377p\0\0\n\203\0S\n\201\2d1\1\377\2\0(j\341\300\250x\2"..., iov_len=36}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
recvmsg(13, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.129.2.100")}, msg_namelen=28 => 16, msg_iov=[{iov_base="E\300\0$\16s\0\0\376p\225|\n\201\2d\n\203\0S1\1\377\2\0(j\341\300\250x\2"..., iov_len=1400}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_CTRUNC|MSG_TRUNC) = 36
recvmsg(13, {msg_namelen=16}, MSG_CTRUNC|MSG_TRUNC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(14, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.129.2.100")}, msg_namelen=16, msg_iov=[{iov_base="E\300\0$\16t\0\0\377p\0\0\n\203\0S\n\201\2d1\1\377\2\0(j\341\300\250x\2"..., iov_len=36}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36

The keepalived logs are available at:
https://gist.github.com/stanluk/cc828b1f99a2f4734f609501eaa8c4ab

from keepalived.

amoy-xuhao avatar amoy-xuhao commented on July 4, 2024

Is there any progress on this issue? I am also encountering the same problem in my Kubernetes cluster

from keepalived.

pqarmitage avatar pqarmitage commented on July 4, 2024

I think this is probably caused by reloading keepalived before the vrrp_startup_delay has expired. Looking in vrrp_dispatcher_read() in vrrp_scheduler.c, there are the following lines of code:

                if (vrrp_delayed_start_time.tv_sec)
                        continue;

which means that any packet received before the start delay timer expires is discarded. However when the restart occurs before the delay timer expires, the timer thread to cancel the timer is removed, and so the timer never expires.

I will continue investigating, and submit a patch later today.

from keepalived.

pqarmitage avatar pqarmitage commented on July 4, 2024

I was able to reproduce this problem, and it was indeed caused by reloading keepalived before the startup_delay timer had expired.

Commit 58483b2 resolves this issue. Many apologies for the long delay in resolving this, but I hadn't previously realised the significance of the startup delay.

from keepalived.

stanluk avatar stanluk commented on July 4, 2024

@pqarmitage thanks for investigating this and providing a patch!

from keepalived.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.