Code Monkey home page Code Monkey logo

Comments (15)

MathieuBordere avatar MathieuBordere commented on September 23, 2024

Thanks, taking a look.

update 1: Cannot reproduce on Ubuntu 22.04 host with 6.2 kernel running the test in an LXD sid container nor in a bookworm container.

from dqlite.

MathieuBordere avatar MathieuBordere commented on September 23, 2024

Aha, I can reproduce if I block my container from accessing the outside world. my container isn't assigned an IPv4 address (which is fortunately an issue I ran into by accident, but allows to reproduce this problem). If I disable my host firewall my LXD container is assigned an IPv4 address and the tests succeed. Not sure if this gives you something useful to work with ... don't know yet if this is a problem on our side.

from dqlite.

gibmat avatar gibmat commented on September 23, 2024

Ah, that's interesting, and reflects my setup. The containers I spin up for packaging work are on an IPv6-only network segment, although they do have both IPv4 and IPv6 addresses setup on the loopback interface. My bookworm machine has both addresses, and the QEMU VM only has IPv4 due to the simple NAT I configured.

I haven't looked into how the tests are trying to bind to addresses -- are they assuming there's an IPv4 address they can use on some "real" (non-loopback) interface?

from dqlite.

cole-miller avatar cole-miller commented on September 23, 2024

@gibmat Thanks for the report. Just for some context -- which release of libraft are you upgrading from, and do the same tests pass in the same containerized environment when run against that earlier version?

It's entirely possible that we have some IPv4-related implicit assumptions in our networking code; I'm looking into what the specific problem might be.

from dqlite.

cole-miller avatar cole-miller commented on September 23, 2024

So at least some of the failures here are caused by failing to resolve 127.0.0.1:9001, which is hardcoded for the tests here:

https://github.com/canonical/raft/blob/b68076fdc7d34a901cc2eb5eb480832c709b2532/test/lib/uv.h#L51

We could conceivably make it possible to run the tests using a different address/port.

(At least, I was able to reproduce the same pattern of test failures by doing ip addr del on the assigned IPv4 address inside a LXD container and running the test suite there; and I traced some of the test failures in that case to failing to a failed call to getaddrinfo against 127.0.0.1:9001 inside uvIpResolveBindAddresses.)

More generally, the code that implements the libuv TCP transport for raft is not IPv6-capable right now.

from dqlite.

cole-miller avatar cole-miller commented on September 23, 2024

The reason that 127.0.0.1 can't be resolved here despite the existence of the loopback interface is that we pass the AI_ADDRCONFIG flag to getaddrinfo.

from dqlite.

ganto avatar ganto commented on September 23, 2024

When trying to build the raft 0.18.0 for Fedora 37 or 38 via namespaced mock buildroot I'm facing the same behaviour as @gibmat. The tests remain stuck at:

  address=localhost:9000, bind-address=localhost:9000       [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
  address=localhost:9000, bind-address=:9000                [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
  address=localhost:9000, bind-address=0.0.0.0:9000         [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
tcp_connect/closeDuringSecondConnect                        [ OK    ] [ 0.00023986 / 0.00029068 CPU ]
tcp_connect/closeDuringConnectAbort

The build environment only has a loopback device:

<mock-chroot> sh-5.2# ip a                                                                                                                                                                                                                                                                                                    
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000                                                                                                                                                                                                                                   
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00                                                                                                                                                                                                                                                                     
    inet 127.0.0.1/8 scope host lo                                                                                                                                                                                                                                                                                            
       valid_lft forever preferred_lft forever                                                                                                                                                                                                                                                                                
    inet6 ::1/128 scope host                                                                                                                                                                                                                                                                                                  
       valid_lft forever preferred_lft forever

The exact same build works fine for raft 0.17.1.

from dqlite.

cole-miller avatar cole-miller commented on September 23, 2024

Thanks for the report @ganto. I have an in-progress PR to rework how we call getaddrinfo in uv_ip.c that will hopefully enable running the test suite smoothly when there is no non-loopback IPv4 address available.

from dqlite.

MathieuBordere avatar MathieuBordere commented on September 23, 2024

The exact same build works fine for raft 0.17.1.

Can you confirm this please, because I can reproduce this behavior from 0.16.0 (incl.) onwards.

from dqlite.

cole-miller avatar cole-miller commented on September 23, 2024

Summary of where I think the problems are and how to approach fixing them:

  • There are two calls to getaddrinfo in the main codebase (note that uv_getaddrinfo just calls getaddrinfo on the thread pool):

    https://github.com/canonical/raft/blob/b68076fdc7d34a901cc2eb5eb480832c709b2532/src/uv_ip.c#L57-L85
    https://github.com/canonical/raft/blob/b68076fdc7d34a901cc2eb5eb480832c709b2532/src/uv_tcp_connect.c#L293-L294

    The first one is used to implement UvTcpListen, the second to implement UvTcpConnect.

  • In both cases, we pass the AI_ADDRCONFIG flag. The upshot of this is that on a system with no configured non-loopback IPv4 addresses, no IPv4 addresses will be returned from getaddrinfo. Because we pass AF_INET to getaddrinfo, no addresses will be returned at all in this case. This explains many of the test failures.

  • I am hesitant to remove AI_ADDRCONFIG from the getaddrinfo calls, because I don't fully understand the implications for downstream. Based on the man page it sounds like without this flag "bogus" addrinfos will be returned in some cases, and then raft would fail (fatally) trying to bind to those addresses.

  • Instead, I would propose that we upgrade the code in uv_ip.c to support parsing IPv6 + port pairs, so that people who want to run the tests can adapt the ones that rely on the network to listen on [::1]:9001 (etc.) if that's appropriate. We already have parsing code in dqlite that can be adapted for this.

  • Of course, we also need to pass something other than AF_INET to getaddrinfo for this to work. Just using AF_UNSPEC runs into problems because on a system that does have an IPv4 address configured, something like getaddrinfo("::1", "9001", AF_UNSPEC) will return both IPv4 and IPv6 addresses that conflict (EADDRINUSE), and on the server side we try to bind to every returned addrinfo. (Which is a design decision that we could revisit.)

  • Instead, we could pass either AF_INET or AF_INET6 to getaddrinfo depending on whether the pre-split address string uses the IPv6-specific syntax [addr]:port. This is backwards-compatible in the sense that both hostnames and IPv4 address strings passed into raft will behave just like they did before. We can continue using AI_V4MAPPED which I think should help with binding/connecting to IPv6 addresses on systems that only have IPv4 set up.

  • The test TCP server code in test/lib/tcp.c will also need updating along the lines I've suggested.

  • I'm not sure what to do about cases where we end up calling getaddrinfo(NULL). We could continue passing AF_INET in this case, which would be backwards-compatible, but then some of our tests will continue to fail. Alternatively, we could use AF_UNSPEC, and then we'd have to work out some alternative to binding to every single returned addrinfo. (Special-casing this makes some sense to me anyway.) I have not tested whether using AF_INET6 + AI_V4MAPPED here is a workable solution.

from dqlite.

cole-miller avatar cole-miller commented on September 23, 2024

As for patching the tests to work on an IPv6-only host (once the other problems in the main codebase have been fixed), it should be mostly a matter of replacing 127.0.0.1 by [::1] in various places; I don't have a comprehensive patch for that yet.

from dqlite.

ganto avatar ganto commented on September 23, 2024

Can you confirm this please, because I can reproduce this behavior from 0.16.0 (incl.) onwards.

Tested again on a Fedora 39 build root. 0.17.1 can still be built and tested fine. However because of canonical/raft#263 I'm still using this patch for that build and that I haven't adjusted it for 0.18.0 yet.

If I remove this patch the tests indeed also fail with 0.17.1.

from dqlite.

gibmat avatar gibmat commented on September 23, 2024

I tried to update dqlite's packaging in Debian to v1.16.4, but am encountering this same issue when trying to build with the now-bundled raft source. tcp_connect/closeDuringConnectAbort still hangs, and there are several additional test failures (due to output size, I'm not including the various failed tests):

180 of 233 (77%) tests successful, 11 (5%) test skipped.
FAIL raft-uv-integration-test (exit status: 1)

Trying to build with the current cowsql/raft fork results in an error:

libtool: link: gcc -std=c11 -g3 -fcf-protection --param=ssp-buffer-size=4 -pipe -fno-strict-aliasing -fdiagnostics-color -fexceptions -fstack-clash-protection -fstack-protector-strong -fasynchronous-unwind-tables -fdiagnostics-show-option -Wall -Wextra -Wimplicit-fallthrough=5 -Wcast-align -Wstrict-prototypes -Wlogical-op -Wmissing-include-dirs -Wold-style-definition -Winit-self -Wfloat-equal -Wsuggest-attribute=noreturn -Wformat=2 -Wshadow -Wendif-labels -Wdate-time -Wnested-externs -Wconversion -Werror -O2 -Wno-conversion -g -O2 -ffile-prefix-map=/build/dqlite-1.16.4=. -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -Wl,-z -Wl,relro -Wl,-z -Wl,now -o integration-test test/integration/integration_test-test_client.o test/integration/integration_test-test_cluster.o test/integration/integration_test-test_fsm.o test/integration/integration_test-test_membership.o test/integration/integration_test-test_node.o test/integration/integration_test-test_role_management.o test/integration/integration_test-test_server.o test/integration/integration_test-test_vfs.o test/integration/integration_test-main.o  ./.libs/libtest.a -luv -ldl -lrt -lpthread -lsqlite3 -lraft ./.libs/libdqlite.so -pthread -Wl,-rpath -Wl,/build/dqlite-1.16.4/.libs
/usr/bin/ld: ./.libs/libdqlite.so: undefined reference to `raft_register_state_cb'
collect2: error: ld returned 1 exit status

Possibly relevant versions: Debian sid LXD container with Debian kernel 6.1.0-18-amd64; libsqlite 3.45.1; libuv 1.48.0; liblz4 1.9.4.

from dqlite.

freeekanayaka avatar freeekanayaka commented on September 23, 2024

I tried to update dqlite's packaging in Debian to v1.16.4, but am encountering this same issue when trying to build with the now-bundled raft source. tcp_connect/closeDuringConnectAbort still hangs, and there are several additional test failures (due to output size, I'm not including the various failed tests):

180 of 233 (77%) tests successful, 11 (5%) test skipped.
FAIL raft-uv-integration-test (exit status: 1)

This could be fixed by applying a change like in cowsql/raft#79, see:

cowsql/raft@1a00ebf

in particular.

Trying to build with the current cowsql/raft fork results in an error:

libtool: link: gcc -std=c11 -g3 -fcf-protection --param=ssp-buffer-size=4 -pipe -fno-strict-aliasing -fdiagnostics-color -fexceptions -fstack-clash-protection -fstack-protector-strong -fasynchronous-unwind-tables -fdiagnostics-show-option -Wall -Wextra -Wimplicit-fallthrough=5 -Wcast-align -Wstrict-prototypes -Wlogical-op -Wmissing-include-dirs -Wold-style-definition -Winit-self -Wfloat-equal -Wsuggest-attribute=noreturn -Wformat=2 -Wshadow -Wendif-labels -Wdate-time -Wnested-externs -Wconversion -Werror -O2 -Wno-conversion -g -O2 -ffile-prefix-map=/build/dqlite-1.16.4=. -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -Wl,-z -Wl,relro -Wl,-z -Wl,now -o integration-test test/integration/integration_test-test_client.o test/integration/integration_test-test_cluster.o test/integration/integration_test-test_fsm.o test/integration/integration_test-test_membership.o test/integration/integration_test-test_node.o test/integration/integration_test-test_role_management.o test/integration/integration_test-test_server.o test/integration/integration_test-test_vfs.o test/integration/integration_test-main.o  ./.libs/libtest.a -luv -ldl -lrt -lpthread -lsqlite3 -lraft ./.libs/libdqlite.so -pthread -Wl,-rpath -Wl,/build/dqlite-1.16.4/.libs
/usr/bin/ld: ./.libs/libdqlite.so: undefined reference to `raft_register_state_cb'
collect2: error: ld returned 1 exit status

This is a known issue, and wouldn't be hard to fix, but since dqlite is now shipping with an embedded raft source code it's probably not worth it, unless somebody really needs that.

from dqlite.

gibmat avatar gibmat commented on September 23, 2024

Thanks!

from dqlite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.