Code Monkey home page Code Monkey logo

Comments (13)

eturck avatar eturck commented on July 22, 2024 3

Another update, probably the last one. But mostly positive ;).
Well we finally had the time to try running navitia with boost 1.55 instead of boost 1.62 in production.

We started by re-enabling disruption on 1.62, there was still some crashes even with the hardware changes.
We installed boost 1.55 beside the current 1.62, a kraken build with, build new nav files with ed2nav build in boost 1.55 and enabled disruptions march 18th. Since then we've loaded close to a thousand disruptions and... no crash whatsoever.
It's impossible to say if it's 100% fixed, but we were crashing pretty quickly before (a few dozens disruptions before a crash at best).
It will probably stay that way in production and we will start to take disruption into account in our journeys soon.

We tried looking for something in boost issue tracker or mailing list with @pbench but nothing was resembling this crash, I really don't know what could cause it.
The next step would be probably to test this with a recent boost version, it will not be very practical to stay with boost 1.55 forever.

If you want to close this issue feel free to do it. There will probably be no changes on our side for quite a while on this.

Thank you everyone for your ideas and your help!

from navitia.

nonifier avatar nonifier commented on July 22, 2024

Hi @eturck,

Sorry to hear about all the crashes, it's quite embarrassing to be honest.

By having a look at the call stack you provided, it reminds me of some memory corruption problem we've had in the past, working on some optimisations in georef. We were messing up the data but could only see the problem when the data was used during serialization.

The root cause and the actual crash were completely unrelated...

This brings me to a few questions:

  • Are you using the exact version of Navitia without any modification ?
  • Have you tried to run your kraken with any of the Address/Memory Sanatizer - https://github.com/google/sanitizers ? (we should really take some time to integrate those tools to our ci though...)
  • Do you have the same crashing rate other versions ?

Funny enough, when we last saw that type of crash on our platform, we had to bisect on our production servers to root cause the issue.... Not a glorious time at all :(

Sorry I couldn't help you much, but I hope this could give you a few leads.
Olivier.

from navitia.

nonifier avatar nonifier commented on July 22, 2024

I've asked the team to retrieve the version that had the issue I mentioned... Hopefully this might be 2.87 🤞

from navitia.

xlqian avatar xlqian commented on July 22, 2024

We have had a random crash on our prod caused by a non initialized int in A*. here is the PR which fixed it: #2846

Note that, this PR has been merged into v2.82.

As @nonifier has suggested, if you're short of ideas about what's happening in the code, you may have to bisect to find which the very PR caused the crash. Let's us know

from navitia.

eturck avatar eturck commented on July 22, 2024

Sorry for the delay I was away the last few days.

@nonifier :

  • We are using release v2.87 https://github.com/CanalTP/navitia/releases/tag/v2.87.0 with this tiny patch from v3.2.0 : #3039. Before that we were using release v2.69 without any patches. However we have custom modules and a custom scenario in jormungandr (nothing too crazy in term of kraken requests I think).
  • We did not try many things since we really had no idea what to do :).
  • We already had crashes on v2.69 with approximately the same rate on the instance not crashing much. No idea for the other one since we just activated them. On our test instance we never had any crashes...

@xlqian since this was merged in v2.82 we have this fix so it is something else :(.

I guess we are going to need to resort to bisect.
Problems are :

  • we currently don't know how to reproduce this crash... And since we only have one kraken in production it's not a possibility to play with this one.
  • the only version that was OK was the v.2.34 we had before but it was running on Debian 7 with boost 1.55 so there is another layer of complexity to identify the problem.

Our main idea is trying to try to launch gatling or something like this and updating disruptions at the same time to see if we can make this happen semi-reliably to make a git bisect worthwhile.

from navitia.

pbougue avatar pbougue commented on July 22, 2024

I'd suggest plug a local memory-sanitized version of navitia on your prod disruption-queue, and request it a bit on the side...
The problem would be the noise in the sanitize-log (probably getting unrelated warnings and maybe errors) but that's the lead I would try.

from navitia.

eturck avatar eturck commented on July 22, 2024

I'd suggest plug a local memory-sanitized version of navitia on your prod disruption-queue, and request it a bit on the side...
The problem would be the noise in the sanitize-log (probably getting unrelated warnings and maybe errors) but that's the lead I would try.

Thanks @pbougue for the suggestion. I don't know much about memory sanitizers but I'll try it out and see if I can find something.

from navitia.

pbougue avatar pbougue commented on July 22, 2024

Welcome, that's a grouped effort here :)
Maybe have a look at https://github.com/CanalTP/navitia/search?q=sanitize&type=Issues you could find some leads to ease the setup.

from navitia.

eturck avatar eturck commented on July 22, 2024

Well, I finally dived into Sanitizers, but I'm not sure of if I found something or not...
I started by building a local navitia on the same version (2.87) plugged-in on our test Chaos server. I mainly tried to have journey requests at the time of the data cloning.

The potentially good news is that I managed to have 3 separate crashes while running Kraken compiled with AdressSanitizer during cloning of the data. The sad part is I managed to do it only twice in the span of half an hour at the start, but not once in the five or six hours after that trying... This morning I manage to do it a third time (also at the start of the day), not once since either.

I don't know if I did something wrong, but there is basically no informations whatsoever though... Here is the gist of it: https://gist.github.com/eturck/2ee7c7107f9eee98c362f7700f5ef5ca I fear that I did something wrong since @pbougue warned me about potentially lots of noise.
The first stracktrace doesn't really match what we had in production even if it ends in libboost_serialization, the second one seems to match a bit more, the third crash was the same as the second one.
However the end result is different, stacktrace 1 is a segfault and stacktrace 2 an std::invalid_argument. In production we always have sigabrt for std::bad_alloc.

I also ran ThreadSanitizer since number of requests seemed to play a role in our production crash.
I have a lot of data races even between to standard threads so I don't know what is a false positive or not.
But I saved the result for the data race between a journey request and the thread when data are copied : https://gist.github.com/eturck/ee0dfcfba885f3d3f6cf710981fba3b4
There is a few "SUMMARY: ThreadSanitizer: data race (/srv/navitia/kraken/kraken+0xcd7c20) in memcpy"

WARNING: ThreadSanitizer: data race (pid=18493)
  Read of size 8 at 0x7dc40003c000 by thread T3 (mutexes: write M44502):
    #0 memcpy <null> (kraken+0x000000cd7c20)
    #1 threadbuf::sync() <null> (kraken+0x0000013549a8)
    #2 boost::archive::basic_binary_iprimitive<boost::archive::binary_iarchive, char, std::char_traits<char> >::basic_binary_iprimitive(std::basic_streambuf<char, std::char_traits<char> >&, bool) <null> (kraken+0x000000fa5c95)
    #3 boost::archive::binary_iarchive_impl<boost::archive::binary_iarchive, char, std::char_traits<char> >::binary_iarchive_impl(std::istream&, unsigned int) <null> (kraken+0x000000fa58fb)
    #4 navitia::type::Data::clone_from(navitia::type::Data const&) <null> (kraken+0x000000f9d3ce)
    #5 DataManager<navitia::type::Data>::get_data_clone() <null> (kraken+0x000000dadcda)
    #6 navitia::MaintenanceWorker::handle_rt_in_batch(std::vector<boost::shared_ptr<AmqpClient::Envelope>, std::allocator<boost::shared_ptr<AmqpClient::Envelope> > > const&) <null> (kraken+0x000000da7dd3)
    #7 navitia::MaintenanceWorker::listen_rabbitmq() <null> (kraken+0x000000da9a2b)
    #8 navitia::MaintenanceWorker::operator()() <null> (kraken+0x000000da921b)
    #9 boost::detail::thread_data<navitia::MaintenanceWorker>::run() <null> (kraken+0x000000d8b6ec)
    #10 boost::detail::make_ready_at_thread_exit(boost::shared_ptr<boost::detail::shared_state_base>) <null> (libboost_thread.so.1.62.0+0x000000012115)

  Previous write of size 8 at 0x7dc40003c000 by thread T6:
    #0 memcpy <null> (kraken+0x000000cd7c20)
    #1 std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) <null> (libstdc++.so.6+0x00000011c52a)
    #2 boost::archive::basic_binary_oarchive<boost::archive::binary_oarchive>::init() <null> (kraken+0x000001014aee)
    #3 boost::archive::binary_oarchive_impl<boost::archive::binary_oarchive, char, std::char_traits<char> >::binary_oarchive_impl(std::ostream&, unsigned int) <null> (kraken+0x0000010145e8)
    #4 std::thread::_State_impl<std::_Bind_simple<navitia::type::Data::clone_from(navitia::type::Data const&)::$_3 ()> >::_M_run() /home/tisseo/navitia/source/type/data.cpp (kraken+0x000000f9db66)
    #5 std::error_code::default_error_condition() const <null> (libstdc++.so.6+0x0000000b9e6e)

I guess it could be related? But I fear we are way (way) passed my knowledges of multithreaded applications...
I'm more than ready to dive a bit more into this but I need some guidance. Do you feel there is something that needs to be investigated there or not?

I'll be on Element in the afternoon to talk about all of this.

Thank you for your help.

from navitia.

pbench avatar pbench commented on July 22, 2024

Hi @eturck !

Thanks for you detailed investigation !

Some thoughts about this :
As @kinnou02 mentionned on Element, this may be related to the data clone trick.
Indeed, Data::clone_from appears in the stack traces you obtained.
The data clone trick seems to be adapted from a stackoverflow answer.

Some differences between the stackoverflow answer and navitia code :

  • in stackoverflow there is three threads : the main one, a writer and a reader. Moreover, there is a sbuf.close(); between the end of the write and the end of the read.
  • in navitia, there is only two threads, the writer, and the main which acts as a reader. The sbuf.close(); has disappeared.

Reading the Navitia code, I don't really understand why it works :man (or I read it wrongly). For me :

  • we launch a write thread
  • we read from the main thread
  • when reading is done, we wait for the completion of the write thread

So for me, the main thread may run the reading operation to completion before the write thread has finished writing, which I think could result in the data races found by ThreadSanitizer.

Maybe you can try to modify navitia to bring it more in-line with the stackoverflow answer :

  • launch a read thread instead of the actual write one
  • launch the write from the main thread
  • apply the sbuf.close(); after the write operation
  • finally join on the read thread

Or even completely reproduce the stackoverflow answer with 3 threads.

from navitia.

eturck avatar eturck commented on July 22, 2024

Thanks for the feedback, I really appreciate the time you all took to take a look at this.

I can't say I really understand the code either. I had the same interrogations on the order of execution. I saw that @TeXitoi was the one responsible for this https://github.com/CanalTP/navitia/pull/632/files#diff-930d109fe1ca9c60e82996bda4dd2f8aR538 :), If he ever stumble upon this message and want to shed some light on those lines...

I was really hopeful with this data race but I unfortunately wasn't able to reproduce it a second time... I had a few others more or less related to data cloning, sometimes there was tens of data race on some totally unrelated parts too. Lots of noise but not once again this one.

I spend three days trying a few things to have a way to at least semi reliably reproduce this crash or data race. Even if it happened once every hour it would be frequent enough to maybe draw a conclusion on a fix, but right now there is just nothing and I really have no clear path to investigate more.

My current plan is:

  • Re-enable disruption in production. We had a few hardware changes, faster CPU and storage and I want to see if it makes any difference and confirm that there is still a problem before trying anything.
  • Try to run kraken in the exact same environment but with boost 1.55 instead of boost 1.62. Since the crash is always in boost_serialization and we didn't have the issue on debian 7 / boost 1.55 and you don't have it on debian 8 / boost 1.55 either... It's a shot in the dark but that is my best lead on this.
  • Cry a lot if nothing works 😭

from navitia.

pbench avatar pbench commented on July 22, 2024

To check if the data race comes from the this data clone trick, maybe it is possible to assign a low priority on the write thread, so that this write thread will be scheduled after the read in the main ?

from navitia.

pbench avatar pbench commented on July 22, 2024

Hey, @pascal-vs suggested an easy way to test whether the data clone trick is responsible for the crashes : add a sleep() in the write thread before the serializing to the archive. This should ensure that the read occurs before the write, and thus trigger some segfault.

from navitia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.