Code Monkey home page Code Monkey logo

Comments (8)

GoesM avatar GoesM commented on May 27, 2024 1

I found that I have some misunderstanding before. I'm so sorry for my fault.

I've deleted the additional information before, and here's a more clear one.

Additional information

code analysis

To clearly understand why use-after-free within multi-threads, it's first to find which thread / function use and which thread free.

the thread for use
It's obvious that the pointer is used in the function BtActionServer::executeCallback()->CompletionCallback() (also named as onCompletion()) -> stopNavigating()

And the function BtActionServer::executeCallback() is called in another thread by the bt_action_server_'s member action_server_ , refer to following code:

https://github.com/ros-planning/navigation2/blob/0897da9252b4cb537b1110e88831a1e63ac157d3/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp#L146-L152

the thread for free

however, there's only action_server_.reset() in bt_action_server_->on_cleanup(), but no wait for action_server_'s sub-thread finished.

https://github.com/ros-planning/navigation2/blob/0897da9252b4cb537b1110e88831a1e63ac157d3/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp#L203-L215

Also, there's only bt_action_server_->oncleanup() in nav2_bt_navigator->on_cleanup(), but also no wait for action_server_'s sub-thread finished.

https://github.com/ros-planning/navigation2/blob/0897da9252b4cb537b1110e88831a1e63ac157d3/nav2_bt_navigator/src/bt_navigator.cpp#L175-L193

result
So that if action_server_'s sub-thread is still running after action_server_.reset() or even after bt_action_server_->on_cleanup() or even after nav2_bt_navigator->on_cleanup() or even after ~nav2_bt_navigator(), the sub-thread would access the freed pointers of nav2_bt_navigator (then UAF occurs).

experiments / POC

To make sure the executeCallback() is still working after the ~nav2_bt_navigator(), I insert some code into executeCallback() as following to extend the time-cost of sub-thread:

template<class ActionT>
void BtActionServer<ActionT>::executeCallback()
{
...
    // insert
      std::this_thread::sleep_for(std::chrono::seconds(4));
    // insert end
  on_completion_callback_(result, rc);
...
}

then launch nav2 normally

then publish a goal_pose like :

ros2 topic pub /goal_pose geometry_msgs/msg/PoseStamped " 
header:
  stamp:
    sec: 1706255138
    nanosec: 811244537
  frame_id: map
pose:
  position:
    x: 0.7312569618225098
    y: 1.8531221151351929
    z: 0.0
  orientation:
    x: 0.0
    y: 0.0
    z: -0.7723305100496971
    w: 0.635220893269715" -1

then Ctrl+C to nav2 before goal-achieved

then asan-report would occur.

It proves that ** the executeCallback() is still working after the ~nav2_bt_navigator() **

similar ticket

I believe it's quite simliar to another UAF-issue #4166, they both just do a shared-pointer reset() but no wait for the sub-thread / sub-process created by the shared-pointer, which brings in the UAF.

suggestion to fix

Is there any way to let the sub-thread / process created by action_server_ joint into the function bt_action_server_->on_cleanup() or nav2_bt_navigator->on_cleanup() ?

from navigation2.

SteveMacenski avatar SteveMacenski commented on May 27, 2024

It is not considered a valid shutdown mid-execution. You should wait between executions, then bring down the lifecycle nodes, then exit the program cleanly. With that said, if you want to improve cleanup using normal-operations signals, power to you!

server_timeout_ the spin timeout is there to handle so we can handle lack of responses. Though, it seems odd to me looking at it now that in those if statements, we don't return after the log.

You could also adjust the definition of should_cancel_goal() https://github.com/ros-planning/navigation2/blob/47374622dee01a27e5f9b8ae08f3d19a15de9b3a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L338-L356 to look at the context / rclcpp::ok() to not cancel in case of halt() in the shutdown case, thereby bypassing all of this.

from navigation2.

SteveMacenski avatar SteveMacenski commented on May 27, 2024

I don't have an immediate answer, it could be many things I'd need to catch myself to see where it happened - but if we handle it the way I describe above, maybe its moot :-)

from navigation2.

SteveMacenski avatar SteveMacenski commented on May 27, 2024

Spin thread appears to be false, so the simple action server isn't using its own executor, so I suppose the first thing to find is where that new thread is created to decide where we want to make sure its joined on the action server's reset().

reset() would destroy the object's members, including the executor and related assets

  rclcpp::CallbackGroup::SharedPtr callback_group_{nullptr};
  rclcpp::executors::SingleThreadedExecutor::SharedPtr executor_;
  std::unique_ptr<nav2_util::NodeThread> executor_thread_;

And as we've shown before, the NodeThread's destructor join()s the thread before exiting. So that all seems fine to me. Finding where that thread is the important part. I don't see that the action server has a new thread - but not that there is one from the client node https://github.com/ros-planning/navigation2/blob/0897da9252b4cb537b1110e88831a1e63ac157d3/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp#L127 for the internals of the BT nodes. I leave it to your sleuthing to find where the issue thread is at to join.

from navigation2.

GoesM avatar GoesM commented on May 27, 2024

Conclusion :

UAF is caused by the error thrown in function SimpleActionServer::deactivate() in [nav2_uti/include/nav2_util/simple_action_server.hpp] of nav2_humble

https://github.com/ros-planning/navigation2/blob/a45b151ceb3aa9edebad8a528cd672935f0c668d/nav2_util/include/nav2_util/simple_action_server.hpp#L289-L296

if here's some error thrown here, it would step in try..catch mechanism of ros2/rclcpp, refer to code link

node_interfaces::LifecycleNodeInterface::CallbackReturn
LifecycleNode::LifecycleNodeInterfaceImpl::execute_callback(
  unsigned int cb_id, const State & previous_state) const
{
  // in case no callback was attached, we forward directly
  auto cb_success = node_interfaces::LifecycleNodeInterface::CallbackReturn::SUCCESS;


  auto it = cb_map_.find(static_cast<uint8_t>(cb_id));
  if (it != cb_map_.end()) {
    auto callback = it->second;
    try {
      cb_success = callback(State(previous_state));
    } catch (const std::exception & e) {
      RCLCPP_ERROR(
        node_logging_interface_->get_logger(),
        "Caught exception in callback for transition %d", it->first);
      RCLCPP_ERROR(
        node_logging_interface_->get_logger(),
        "Original error: %s", e.what());
      cb_success = node_interfaces::LifecycleNodeInterface::CallbackReturn::ERROR;
    }
  }
  return cb_success;
}

Following that, it interrupts the process's original deactivate() function, causing the process to exit without completing the task of waiting for the thread to exit, resulting in a UAF.

from navigation2.

GoesM avatar GoesM commented on May 27, 2024

SO, it seems that we should solve error-thrown here, if we want to avoid UAF.

error thrown 1 : the lack of nullptr-check bring in bad function call error catched

in nav2-main stack , here's a nullptr-check for completion_callback_ function-pointer, which is used to avoid nullptr-accessed

https://github.com/ros-planning/navigation2/blob/09ba08b7f45004df92f77fa683ea33d8afa2acce/nav2_util/include/nav2_util/simple_action_server.hpp#L317

However, here's the lack of it in nav2-humble version:

https://github.com/ros-planning/navigation2/blob/a45b151ceb3aa9edebad8a528cd672935f0c668d/nav2_util/include/nav2_util/simple_action_server.hpp#L293

BUT, according to the code of bt_navigator, this pointer is always NULL, so that NullPtr-accessed would be catched as "bad function call".

debug info for error thrown 1

I insert some log here for debug, as following:

 while (execution_future_.wait_for(milliseconds(100)) != std::future_status::ready) {
      info_msg("Waiting for async process to finish.");
      if (steady_clock::now() - start_time >= server_timeout_) {
           // goes insert
              warn_msg("terminate_all() started. ");
           // insert end
        terminate_all();
          // goes insert
            warn_msg("terminate_all() finished. ");
            warn_msg("completion_callback_ start ");
          // insert end
      if(completion_callback_) completion_callback_();
          // goes insert
            warn_msg("completion_callback_ finished.");
            warn_msg("before throw error. ");
          // insert end
        throw std::runtime_error("Action callback is still running and missed deadline to stop");
      }
    debug_msg("Deactivation completed.");
  }

then, the log of nav2_bt_navigator shows:

[INFO] [1710427380.141387057] [rclcpp]: signal_handler(signum=2)
[INFO] [1710427380.141517732] [bt_navigator]: Running Nav2 LifecycleNode rcl preshutdown (bt_navigator)
[INFO] [1710427380.141572676] [bt_navigator]: Deactivating
[WARN] [1710427380.141646003] [bt_navigator]: [navigate_to_pose] [ActionServer] Requested to deactivate server but goal is still executing. Should check if action server is running before deactivating.
[ERROR] [1710427380.171597874] [bt_navigator_navigate_to_pose_rclcpp_node]: Failed to get result for follow_path in node halt!
[INFO] [1710427380.241779538] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427380.341846408] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427380.442045833] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427380.542353693] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427380.642514433] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[WARN] [1710427380.642575567] [bt_navigator]: [navigate_to_pose] [ActionServer] terminate_all() started. 
[WARN] [1710427380.642596713] [bt_navigator]: [navigate_to_pose] [ActionServer] Aborting handle.
[WARN] [1710427381.642985458] [bt_navigator]: [navigate_to_pose] [ActionServer] terminate_all() finished. 
[WARN] [1710427381.642994046] [bt_navigator]: [navigate_to_pose] [ActionServer] completion_callback_ start 
[ERROR] [1710427381.643055223] [bt_navigator]: Caught exception in callback for transition 14
[ERROR] [1710427381.643063858] [bt_navigator]: Original error: bad_function_call
[WARN] [1710427381.643086517] [bt_navigator]: Error occurred while doing error handling.
[FATAL] [1710427381.643113948] [bt_navigator]: Lifecycle node bt_navigator does not have error state implemented
[INFO] [1710427381.643134721] [bt_navigator]: Destroying bond (bt_navigator) to lifecycle manager.

As we can see, the log is interrupted after completion_callback_ start and before completion_callback_ finished, which proves that the interruption is caused by the nullptr-accessed here.

error thrown 2 : the proactive "throw error" leads to error catched

To avoid NullPtr-accessed, I fixed as how it's fixed in nav2-main:

 while (execution_future_.wait_for(milliseconds(100)) != std::future_status::ready) {
      info_msg("Waiting for async process to finish.");
      if (steady_clock::now() - start_time >= server_timeout_) {
           // goes insert
              warn_msg("terminate_all() started. ");
           // insert end
        terminate_all();
          // goes insert
            warn_msg("terminate_all() finished. ");
            warn_msg("completion_callback_ start ");
          // insert end
     if(completion_callback_) completion_callback_();
          // goes insert
            warn_msg("completion_callback_ finished.");
            warn_msg("before throw error. ");
          // insert end
        throw std::runtime_error("Action callback is still running and missed deadline to stop");
      }
    debug_msg("Deactivation completed.");
  }

BUT, the UAF still occurs. That because the code throw an runtime_error here, also catched by rclcpp:

https://github.com/ros-planning/navigation2/blob/a45b151ceb3aa9edebad8a528cd672935f0c668d/nav2_util/include/nav2_util/simple_action_server.hpp#L294

debug info for error thrown 2

we can see the log is as following:

[INFO] [1710427133.548628103] [rclcpp]: signal_handler(signum=2)
[INFO] [1710427133.548761756] [bt_navigator]: Running Nav2 LifecycleNode rcl preshutdown (bt_navigator)
[INFO] [1710427133.548870713] [bt_navigator]: Deactivating
[WARN] [1710427133.548921523] [bt_navigator]: [navigate_to_pose] [ActionServer] Requested to deactivate server but goal is still executing. Should check if action server is running before deactivating.
[ERROR] [1710427133.572469539] [bt_navigator_navigate_to_pose_rclcpp_node]: Failed to get result for follow_path in node halt!
[INFO] [1710427133.649079516] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427133.749291894] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427133.849653475] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427133.949895585] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[INFO] [1710427134.050078388] [bt_navigator]: [navigate_to_pose] [ActionServer] Waiting for async process to finish.
[WARN] [1710427134.050137301] [bt_navigator]: [navigate_to_pose] [ActionServer] terminate_all() started. 
[WARN] [1710427134.050147568] [bt_navigator]: [navigate_to_pose] [ActionServer] Aborting handle.
[WARN] [1710427135.050402705] [bt_navigator]: [navigate_to_pose] [ActionServer] terminate_all() finished. 
[WARN] [1710427135.050417258] [bt_navigator]: [navigate_to_pose] [ActionServer] before throw error. 
[ERROR] [1710427135.050471087] [bt_navigator]: Caught exception in callback for transition 14
[ERROR] [1710427135.050495290] [bt_navigator]: Original error: Action callback is still running and missed deadline to stop
[WARN] [1710427135.050514797] [bt_navigator]: Error occurred while doing error handling.
[FATAL] [1710427135.050531012] [bt_navigator]: Lifecycle node bt_navigator does not have error state implemented
[INFO] [1710427135.050548761] [bt_navigator]: Destroying bond (bt_navigator) to lifecycle manager.

it shows that the runtime_error is catched by rclcpp, and no final debug_msg Deactivation completed. occurs, which proves that the processes is also interrupted by try-catch of rclcpp

from navigation2.

GoesM avatar GoesM commented on May 27, 2024

suggestion to fix ?

1> add nullptr-check in nav2_humble

2> just warn but no throw error ?

I'm not very clear about the details of why there's a proactive "throw error" approach and why there's a “try-catch” mechanism in rclcpp, but I believe they are set for mitigating the harm caused by errors.

However, if such methods end up introducing more harmful UAF-type bugs, personally, I think it might be better for us not to proactively "throw errors" but just warn_msg() it for the user ?

Since, if user wrongly shutdown the navigation when some tasks are still on-work , it should be better to show a warning for his own operation than bring a more harmful security-issue like UAF ? I think....

from navigation2.

SteveMacenski avatar SteveMacenski commented on May 27, 2024

Please submit a PR! I think a strongly worded error message would be sufficient as well

from navigation2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.