Code Monkey home page Code Monkey logo

Comments (18)

jacquelinekay avatar jacquelinekay commented on June 30, 2024

Thanks for reporting this. Can you post a minimal working example that compiles and link to it here (perhaps via Gist)?

from rclcpp.

firesurfer avatar firesurfer commented on June 30, 2024

Hey, unfortunately I don't have the time to create a minimal example at the moment. I try to give one at the end of the week.

At the moment I did a small fix in rclcpp/client.hpp:

in the function handle_response

if(call_promise)
{
    call_promise->set_value(typed_response);
    callback(future);
}

With this fix everthing is working without any segfaults.
But I don't think that this should be permanent fix.

from rclcpp.

firesurfer avatar firesurfer commented on June 30, 2024

Hey,
I create a minimum demo on:
https://github.com/firesurfer/Segfault_demo

It consists of two parts: a segfault demo server and a segfault demo client. The segfault occurs in the client (test.cpp)

from rclcpp.

jacquelinekay avatar jacquelinekay commented on June 30, 2024

Thanks. I'll take a look at this today.

from rclcpp.

jacquelinekay avatar jacquelinekay commented on June 30, 2024

The segfault is generalizable to services/clients, it's not specific to parameters (parameters are implemented on top of services/clients). I rewrote the test with a plain service and client and reproduced the segfault in ros2/system_tests#104.

My potential fix is in #193. It's not much more complicated than what you did, but I'm also not sure if that should be the permanent fix.

from rclcpp.

dirk-thomas avatar dirk-thomas commented on June 30, 2024

Based on @gerkey investigation I think we should first merge these two (ros2/rmw_connext#131 ros2/rmw_opensplice#112) once the Windows CI job turns over and is green. They have the potential to fix other issues (like 100% cpu load, etc.) too.

from rclcpp.

gerkey avatar gerkey commented on June 30, 2024

After some investigation, with consultation from @jacquelinekay and @dirk-thomas, I believe the following things:

  • The proposed fix in #193 addresses the client-side segfault. I haven't seen any client-side problems at all.
  • With the client-side problem fixed, this test case (ros2/system_tests#104) now exposes a server-side problem with services.
  • The timeout seen by Jenkins is due to the service server getting into a bad state in between the two service calls being made by the client.
  • The effect of the bad state is that wait() continuously returns the service-associated read condition as something that should be checked, but rmw_take_request() continuously fails to take any request data; this loop proceeds forever, causing spin_node_some() to never return.
  • The bad state is caused by something happening to the read condition associated with the service server when the first service client disconnects. I don't know where the bad state originates; it might be in OpenSplice, or in rmw_opensplice_cpp, or nowhere, in the sense that we should be handling an event in rmw_opensplice_cpp that we're currently not.
  • A related issue is that we're doing unnecessary work because we're not nulling service (and other) handles when wait() times out; @dirk-thomas addressed that in ros2/rmw_opensplice#112.
  • All of the above might also apply to other rmw implementations (e.g., ros2/rmw_connext#131).

from rclcpp.

jacquelinekay avatar jacquelinekay commented on June 30, 2024

I'm still seeing a segfault on master with the Opensplice version of the test in ros2/system_tests#104, and an occasional deadlock with #193.

from rclcpp.

gerkey avatar gerkey commented on June 30, 2024

@jacquelinekay That's what I'd expect: running ros2/system_tests#104 without #193 gives you a segfault (essentially the original bug reported here) and running it with #193 gives you a deadlock (caused by the "bad state" that the service server gets into upon client disconnection). @dirk-thomas's fix in ros2/rmw_opensplice#112 addresses a similar issue but does not get rid of the deadlock.

from rclcpp.

jacquelinekay avatar jacquelinekay commented on June 30, 2024

Got it, I thought you were suggesting those PRs would fix the issue, but you're saying that the failure might be due to related issues.

from rclcpp.

gerkey avatar gerkey commented on June 30, 2024

Yeah, there's definitely something still lurking in the management of service server read conditions when a client disconnects. We were able to demonstrate the problem last night with a single client that calls the service, then sleeps for several seconds, then exits. The service server looks good at first, responds correctly to the service call, still looks good, then goes bad immediately when the client exits. I'm hoping to look further into it today.

from rclcpp.

dirk-thomas avatar dirk-thomas commented on June 30, 2024

While this line listens to any instances:

https://github.com/ros2/rmw_opensplice/blob/5a56edf889f563d52be002258e6f4722cf9ddda2/rmw_opensplice_cpp/src/rmw_service.cpp#L120

that line only takes the alive ones:

https://github.com/ros2/rmw_opensplice/blob/5a56edf889f563d52be002258e6f4722cf9ddda2/rosidl_typesupport_opensplice_cpp/resource/srv__type_support.cpp.template#L91

The result is that once a sample with non-alive instance arrives opensplice wakes up from rmw_wait immediately forever resulting in a busy loop. That loop happens within a single spin_* call within rclcpp.

from rclcpp.

gerkey avatar gerkey commented on June 30, 2024

So changing to ANY_INSTANCE_STATE in the take() call will fix it?

from rclcpp.

dirk-thomas avatar dirk-thomas commented on June 30, 2024

Hopefully yes, running CI jobs for it right now...

from rclcpp.

dirk-thomas avatar dirk-thomas commented on June 30, 2024

New CI jobs:

from rclcpp.

dirk-thomas avatar dirk-thomas commented on June 30, 2024

Creating a second client after the first one was destroyed fails with Connext dynamic on all platforms. Looking into it...

from rclcpp.

dirk-thomas avatar dirk-thomas commented on June 30, 2024

New CI jobs (including ros2/rmw_connext#134):

from rclcpp.

tfoote avatar tfoote commented on June 30, 2024

From the meeting. Merge this as is with the return after a warning. Add a todo, and after the alpha reescalate it to an exception so we can find it if it happens in the future.

from rclcpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.