Code Monkey home page Code Monkey logo

Comments (30)

BenQLange avatar BenQLange commented on August 11, 2024 2

I don't think we can write a try, except block for floating point exceptions or assertion errors. I tried and it was still killing the worker and stopping the script.

Instead, I have iterated through the dataset with the above configs and created a dictionary of failing files (bash script with a loop until it finished iterating through a dataset). For now, I just skip those files during training.

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024 1

Oh that's super useful that you can reproduce it without the training! So it's in the worker or possibly in Nocturne itself...
I'll try to reproduce it using the smaller dataset but otherwise it'll be a few days until my new laptop arrives and I can do some analysis on the full dataset, sorry!

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024 1

I have enabled debug option in setup.py. Now I am getting the following errors:

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41: nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const: Assertion `t >= 0.0f && t <= 1.0f' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67: nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&): Assertion `VerifyVerticesOrder()' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

I am not a C++ wizard. Is it possible that those assertion errors lead to the floating point exception?

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024 1

I think that's probably it; great job and thank you!! @xiaomengy (our C++ wizard) do you see how this error could occur? We could definitely use your insight here

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024 1

Hey @BenQLange, just to give you an update we're slightly backlogged but Xiaomeng will take a look at this on Tuesday. Figured it was better to have a time than persistent uncertainty

from nocturne.

xiaomengy avatar xiaomengy commented on August 11, 2024 1

Hi @BenQLange. Sorry for being late because of some other deadlines. I will take a detailed look later today and hopefully resolve it ASAP.

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024 1

Thanks for finding those! We are still looking into it but in the meantime would including a try, except block in your code temporarily resolve this issue so that you aren't blocked? We should have a resolution soon.

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024 1

Modified dataset resolves the assertion errors but I am still experiencing floating point exceptions from time to time :(

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024 1

Hmm, we are still looking into it. I just got a new laptop with enough space for the whole dataset so hopefully I can reconstruct your errors and help.

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024 1

Oh! Okay, let me throw on the debug flag and try again. Thanks for the suggestion.

from nocturne.

xiaomengy avatar xiaomengy commented on August 11, 2024 1

Hi @BenQLange. Just let you know a progress. It seems there exists one vehicle/object that has a negative length in tfrecord-00008-of-01000_364.json, which is at least the reason of assert failure. Now we are investigating why there is such values and will try to have some solution to deal with such cases.

We found an objects with shape of "width": 4.4137163162231445, "length": -1.295910358428955 in tfrecord-00008-of-01000_364.json

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024 1

We're following up with Waymo here waymo-research/waymo-open-dataset#542 and will hopefully find some resolution (though the floating point error is probably from a different source).

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

Ooof; thank you for catching and reporting this. We have never seen this.

A few questions to see if anything is different from your setup then ours.
Question 1: are you training on the mini-dataset or the full dataset?
Question 2: Are you using all the files or a subset of the files i.e. are you modifying the value of num_files in the config?

Then, a reproducibility step:

  1. Is there any chance you can get the dataloader to print the scenario_path when this happens? This is a value defined in the _get_waymo_iterator here (https://github.com/facebookresearch/nocturne/blob/main/examples/imitation_learning/waymo_data_loader.py). Seeing this might help us investigate the right file and find it faster.
  2. Could you print the state and action values on the off-chance you observe whether it's a state or an action?

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

Once I know if it's the mini or full-dataset and how many files you are using, I'll run the dataloader over the relevant files and see if we can find the file where this error occurs.

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

Sounds good. I am using the full dataset with num_files set to -1 (entire dataset). I'll let you when I know the file name.

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

Thanks for that info!

@xiaomengy I'm going to write a quick script tomorrow to search through the dataset and build samples from the dataloader and return any files if they throw an error. Would you be able to run it on the cluster and send me any file-names it flags?

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

One last question, does it ever train to completion or is this blocking you from completing any training run? Just trying to get a sense of how rare it is.

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

It does train to completion most often. It fails 20%ish of the time

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

Great, that's useful information.

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

It's weird. It's not caused by a specific file. Sometimes it iterates through all files with no issue, sometimes it crashes :(

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

Well, it's interesting, it's caused in the call to distribution so I'm wondering if there's actually just a model creating a NaN in the step between passing the state through the head and before passing the output of that to the MultiVariateNormal distribution rather than a file error. It seems to be complaining that a value and its transpose are not close? Since the model training is running in serial you could throw a breakpoint into a try, except block and see what is being passed when that method errors?

I'll try to help more but I have yet to reproduce the issue on my local machine (admittedly, training is slow). Will be faster once I get access to a cluster again.

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

Ah! One more thing that @nathanlct pointed out, are you using Discrete actions or Continuous actions? We've only extensively tested the discrete setting, perhaps the precision / covariance matrix is acting up

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

I don't think it's related to the call to distribution. It happens for both action and position action spaces. When I just iterate through the dataset in a simple script I am sometimes (but not always ?) getting a floating point exception. I have only screenshots of the traceback (sorry).
It's really confusing.

Screen Shot 2022-07-28 at 10 30 14 AM

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

This is the backtrace with gdb when it fails:

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x00007fff4a29cb85 in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
(gdb) bt
#0  0x00007fff4a29cb85 in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#1  0x00007fff4a2b08ab in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#2  0x00007fff4a2ab29a in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#3  0x00007fff4a2ad5dd in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#4  0x00007fff4a27dd2a in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#5  0x00007fff4a274a98 in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#6  0x00007fff4a269c5d in ?? ()
   from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#7  0x000055555569000e in cfunction_call_varargs ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:743
#8  0x000055555568513f in _PyObject_MakeTpCall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#9  0x00005555556bacba in _PyObject_Vectorcall (kwnames=0x0, nargsf=3, args=0x7fffffffd2d0, 
    callable=0x7fff4a6e83b0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#10 method_vectorcall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/classobject.c:89
#11 0x000055555568b20d in PyVectorcall_Call (kwargs=0x0, tuple=0x7fff4a50b840, 
    callable=0x7fffa512b7c0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:200
#12 PyObject_Call () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:228
#13 0x000055555562f9cb in slot_tp_init (self=0x7fff43099cf0, args=0x7fff4a50b840, kwds=0x0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/typeobject.c:6793
#14 0x000055555568ff27 in type_call ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/typeobject.c:994
#15 0x00007fffeec764b9 in pybind11_meta_call ()
   from /home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#16 0x000055555568513f in _PyObject_MakeTpCall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#17 0x000055555572f89f in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, 
    args=0x55555856e4d8, callable=0x555558b66080)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#18 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, 
    tstate=0x5555558f3ff0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#19 _PyEval_EvalFrameDefault ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#20 0x00005555557210ff in PyEval_EvalFrameEx (throwflag=0, f=0x55555856e240)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#21 _PyEval_EvalCodeWithName ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#22 0x0000555555721bc4 in _PyFunction_Vectorcall ()
    at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:436
#23 0x000055555572b0bb in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, 
    args=0x7ffff6f6a5b8, callable=0x7ffff6fd0310)
---Type <return> to continue, or q <return> to quit---
   8/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555558f3ff0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#25 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#26 0x0000555555720600 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff6f6a440)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#27 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#28 0x0000555555721eb3 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, 
    args=0x0, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4327
#29 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:718
#30 0x0000555555796622 in run_eval_code_obj () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1166
#31 0x00005555557a71d2 in run_mod () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1188
#32 0x00005555557aa36b in pyrun_file () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1085
#33 0x00005555557aa54f in pyrun_simple_file (flags=0x7fffffffdb08, closeit=1, filename=0x7ffff6e8b4b0, fp=0x55555596b500)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:439
#34 PyRun_SimpleFileExFlags () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:472
#35 0x00005555557aaa29 in pymain_run_file (cf=0x7fffffffdb08, config=0x5555558f3020)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:391
#36 pymain_run_python (exitcode=0x7fffffffdb00) at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:616
#37 Py_RunMain () at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:695
#38 0x00005555557aac29 in Py_BytesMain () at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:1127
#39 0x00007ffff703fc87 in __libc_start_main (main=0x55555565bea0 <main>, argc=2, argv=0x7fffffffdcf8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdce8) at ../csu/libc-start.c:310
#40 0x000055555574dad7 in _start ()

Does it tell you anything about the root cause?

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

Here is a backtrace for the line segment error:

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41: nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const: Assertion `t >= 0.0f && t <= 1.0f' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff705e7f1 in __GI_abort () at abort.c:79
#2  0x00007ffff704e3fa in __assert_fail_base (fmt=0x7ffff71d56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fff4a2cf33a "t >= 0.0f && t <= 1.0f", 
    file=file@entry=0x7fff4a2cf2f0 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h", 
    line=line@entry=41, 
    function=function@entry=0x7fff4a2cf360 <nocturne::geometry::LineSegment::Point(float) const::__PRETTY_FUNCTION__> "nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const") at assert.c:92
#3  0x00007ffff704e472 in __GI___assert_fail (assertion=assertion@entry=0x7fff4a2cf33a "t >= 0.0f && t <= 1.0f", 
    file=file@entry=0x7fff4a2cf2f0 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h", 
    line=line@entry=41, 
    function=function@entry=0x7fff4a2cf360 <nocturne::geometry::LineSegment::Point(float) const::__PRETTY_FUNCTION__> "nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const") at assert.c:101
#4  0x00007fff4a2b10b4 in nocturne::geometry::LineSegment::Point (t=<optimized out>, this=<optimized out>)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41
#5  nocturne::(anonymous namespace)::VisibleObjectsImpl (objects=std::vector of length 16, capacity 32 = {...}, o=..., 
    points=std::vector of length 72, capacity 128 = {...}) at /home/bernard.lange/nocturne/nocturne/cpp/src/view_field.cc:84
#6  0x00007fff4a2b2507 in nocturne::ViewField::FilterVisibleObjects (this=this@entry=0x7fffffffcce0, 
    objects=std::vector of length 16, capacity 32 = {...}) at /home/bernard.lange/nocturne/nocturne/cpp/src/view_field.cc:156
#7  0x00007fff4a286541 in nocturne::Scenario::VisibleObjects (this=this@entry=0x55555d6cadc0, src=..., 
    view_dist=view_dist@entry=80, view_angle=view_angle@entry=2.09439516, head_angle=head_angle@entry=0)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:362
#8  0x00007fff4a28852e in nocturne::Scenario::FlattenedVisibleState (this=0x55555d6cadc0, src=..., 
    view_dist=view_dist@entry=80, view_angle=view_angle@entry=2.09439516, head_angle=head_angle@entry=0)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:508
#9  0x00007fff4a267742 in nocturne::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>::operator() (__closure=<optimized out>, head_angle=0, view_angle=2.09439516, view_dist=80, src=..., scenario=...)
    at /home/bernard.lange/nocturne/nocturne/pybind11/src/scenario.cc:73
#10 pybind11::detail::argument_loader<nocturne::Scenario const&, nocturne::Object const&, float, float, float>::call_impl<pybind11::array_t<float, 16>, nocturne::DefineScenario(pybind11::module&)::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>&, 0, 1, 2, 3, 4, pybind11::detail::void_type> (f=..., this=0x7fffffffd030)
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/cast.h:1418
#11 pybind11::detail::argument_loader<nocturne::Scenario const&, nocturne::Object const&, float, float, float>::call<pybind11::array_t<float, 16>, pybind11::detail::void_type, nocturne::DefineScenario(pybind11::module&)::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>&> (f=..., this=<optimized out>)
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/cast.h:1387
#12 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator() (__closure=0x0, call=...)
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:249
#13 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::_FUN(pybind11::detail::function_call &) ()
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:224
#14 0x00007fff4a243e49 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fff43096ec0, 
    kwargs_in=0x7fff4a503280) at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:924
#15 0x000055555569000e in cfunction_call_varargs () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:743
#16 0x000055555568513f in _PyObject_MakeTpCall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#17 0x00005555556baca0 in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x55555856ca40, 
    callable=0x7fff4a688f40) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#18 method_vectorcall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/classobject.c:60
#19 0x000055555572beb0 in _PyObject_Vectorcall (kwnames=0x7ffff6e66280, nargsf=<optimized out>, args=<optimized out>, 
    callable=0x7fff53c96580) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:127
#20 call_function (kwnames=0x7ffff6e66280, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=<optimized out>)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#21 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3515
#22 0x00005555557210ff in PyEval_EvalFrameEx (throwflag=0, f=0x55555856c7a0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#23 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#24 0x0000555555721bc4 in _PyFunction_Vectorcall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:436
#25 0x000055555572b0bb in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff6f6a5b8, 
    callable=0x7ffff6fcf310) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:127
#26 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555558f3ff0)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#27 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#28 0x0000555555720600 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff6f6a440)
    at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#29 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#30 0x0000555555721eb3 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, 
---Type <return> to continue, or q <return> to quit---

And for the polygon error:

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67: nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&): Assertion `VerifyVerticesOrder()' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff705e7f1 in __GI_abort () at abort.c:79
#2  0x00007ffff704e3fa in __assert_fail_base (
    fmt=0x7ffff71d56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fff4a2c88ad "VerifyVerticesOrder()", 
    file=file@entry=0x7fff4a2c8868 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h", line=line@entry=67, 
    function=function@entry=0x7fff4a2c88e0 <nocturne::geometry::ConvexPolygon::ConvexPolygon(std::initializer_list<nocturne::geometry::Vector2D> const&)::__PRETTY_FUNCTION__> "nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&)")
    at assert.c:92
#3  0x00007ffff704e472 in __GI___assert_fail (
    assertion=assertion@entry=0x7fff4a2c88ad "VerifyVerticesOrder()", 
    file=file@entry=0x7fff4a2c8868 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h", line=line@entry=67, 
    function=function@entry=0x7fff4a2c88e0 <nocturne::geometry::ConvexPolygon::ConvexPolygon(std::initializer_list<nocturne::geometry::Vector2D> const&)::__PRETTY_FUNCTION__> "nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&)")
    at assert.c:101
#4  0x00007fff4a280052 in nocturne::geometry::ConvexPolygon::ConvexPolygon (vertices=..., this=0x7fffffffc370)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67
#5  nocturne::Object::BoundingPolygon (this=<optimized out>)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/object.cc:27
#6  0x00007fff4a2a810e in nocturne::ObjectBase::GetAABB (
    this=<optimized out>)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/object_base.h:66
#7  void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}::operator()(std::shared_ptr<nocturne::Object> const&) const (obj=..., 
    __closure=<synthetic pointer>)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:107
#8  nocturne::geometry::BVH::ResetImpl<std::shared_ptr<nocturne::Object>, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#2}>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#2}) (this=this@entry=0x55555a066ed8, 
    objects=std::vector of length 72, capacity 128 = {...}, 
    aabb_func=..., ptr_func=...)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:163
#9  0x00007fff4a28efe9 in nocturne::geometry::BVH::Reset<nocturne::Object> (objects=std::vector of length 72, capacity 128 = {...}, 
    this=0x55555a066ed8)
    at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:105
#10 nocturne::Scenario::LoadObjects (this=this@entry=0x55555a066d00, 
    objects_json=...)
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:1127
#11 0x00007fff4a2919ad in nocturne::Scenario::LoadScenario (
    this=this@entry=0x55555a066d00, 
    scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_dataset/val/tfrecord-00506-of-01000_353.json")
    at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:227
#12 0x00007fff4a26a264 in nocturne::Scenario::Scenario (
    this=0x55555a066d00, 
    scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_d
---Type <return> to continue, or q <return> to quit---
ataset/val/tfrecord-00506-of-01000_353.json", config=std::unordered_map with 6 elements = {...})
    at /home/bernard.lange/nocturne/nocturne/cpp/include/scenario.h:100
#13 0x00007fff4a2766a0 in std::make_unique<nocturne::Scenario, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::variant<bool, long, float>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::variant<bool, long, float> > > > const&> () at /usr/include/c++/7/bits/unique_ptr.h:821
#14 nocturne::Simulation::Simulation (config=std::unordered_map with 6 elements = {...}, 
    scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_dataset/val/tfrecord-00506-of-01000_353.json", 
    this=0x55555c020f20) at /home/bernard.lange/nocturne/nocturne/cpp/include/simulation.h:32
#15 pybind11::detail::initimpl::construct_or_initialize<nocturne::Simulation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::variant<bool, long, float>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::variant<bool, long, float> > > > const&, 0> ()
    at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/detail/init.h:73

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

So floating point exceptions are not deterministic, but assertion errors are. I have identified invalid files in the training set:

array(['tfrecord-00008-of-01000_364.json',
       'tfrecord-00104-of-01000_303.json',
       'tfrecord-00128-of-01000_365.json',
       'tfrecord-00131-of-01000_86.json',
       'tfrecord-00214-of-01000_146.json',
       'tfrecord-00402-of-01000_57.json',
       'tfrecord-00506-of-01000_353.json',
       'tfrecord-00689-of-01000_184.json',
       'tfrecord-00811-of-01000_413.json',
       'tfrecord-00074-of-01000_192.json',
       'tfrecord-00090-of-01000_237.json',
       'tfrecord-00151-of-01000_418.json',
       'tfrecord-00179-of-01000_445.json',
       'tfrecord-00203-of-01000_466.json',
       'tfrecord-00206-of-01000_87.json',
       'tfrecord-00241-of-01000_464.json',
       'tfrecord-00247-of-01000_214.json',
       'tfrecord-00279-of-01000_81.json',
       'tfrecord-00298-of-01000_75.json',
       'tfrecord-00325-of-01000_483.json',
       'tfrecord-00343-of-01000_188.json',
       'tfrecord-00376-of-01000_41.json',
       'tfrecord-00396-of-01000_203.json',
       'tfrecord-00411-of-01000_295.json',
       'tfrecord-00431-of-01000_130.json',
       'tfrecord-00472-of-01000_85.json',
       'tfrecord-00483-of-01000_62.json',
       'tfrecord-00487-of-01000_377.json',
       'tfrecord-00532-of-01000_444.json',
       'tfrecord-00534-of-01000_37.json',
       'tfrecord-00564-of-01000_247.json',
       'tfrecord-00567-of-01000_34.json',
       'tfrecord-00570-of-01000_361.json',
       'tfrecord-00580-of-01000_420.json',
       'tfrecord-00616-of-01000_211.json',
       'tfrecord-00639-of-01000_188.json',
       'tfrecord-00653-of-01000_394.json',
       'tfrecord-00711-of-01000_490.json',
       'tfrecord-00735-of-01000_12.json',
       'tfrecord-00738-of-01000_388.json',
       'tfrecord-00754-of-01000_415.json',
       'tfrecord-00802-of-01000_74.json',
       'tfrecord-00805-of-01000_368.json',
       'tfrecord-00810-of-01000_6.json',
       'tfrecord-00829-of-01000_456.json',
       'tfrecord-00846-of-01000_330.json',
       'tfrecord-00863-of-01000_432.json',
       'tfrecord-00868-of-01000_297.json',
       'tfrecord-00869-of-01000_43.json',
       'tfrecord-00924-of-01000_471.json',
       'tfrecord-00937-of-01000_364.json',
       'tfrecord-00962-of-01000_378.json',
       'tfrecord-00984-of-01000_128.json'], dtype='<U32')

Hopefully, that's the reason behind floating point exception errors. I'll let you know after I run some more experiments.

UPDATE: There is more failing files. I didn't iterate over time :(

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

Small update, here are the configs I used to find the failing scenes listed above. Depending on some configs I get more or less assertion errors. In particular, I noticed it when changing the view angle. Hope that helps.

    # load dataloader config
    dataloader_config = {
        'tmin': 0,
        'tmax': 90,
        'view_dist': 80,
        'view_angle': np.radians(120),
        'dt': 0.1,
        'expert_action_bounds': None,
        'expert_position': True,
        'state_normalization': 100,
        'n_stacked_states': 5,
        'perturbations': False,
    }

    scenario_config = {
        'start_time': 0,
        'allow_non_vehicles': True,
        'spawn_invalid_objects': True,
        'max_visible_road_points': 500,
        'sample_every_n': 1,
        'road_edge_first': False,
    }

    tmin = dataloader_config.get('tmin', 0)
    tmax = dataloader_config.get('tmax', 90)
    view_dist = dataloader_config.get('view_dist', 80)
    view_angle = dataloader_config.get('view_angle', np.radians(120))
    dt = dataloader_config.get('dt', 0.1)
    expert_action_bounds = dataloader_config.get('expert_action_bounds',
                                                 [[-6, 6], [-0.7, 0.7]])
    expert_position = dataloader_config.get('expert_position', True)
    state_normalization = dataloader_config.get('state_normalization', 100)
    n_stacked_states = dataloader_config.get('n_stacked_states', 5)

from nocturne.

eugenevinitsky avatar eugenevinitsky commented on August 11, 2024

Are the errors on the files you listed deterministic? I've constructed the subset of files that you have and looped the dataloader over them but am not seeing an error yet.

Reproduction script for reference

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""Imitation learning training script (behavioral cloning)."""
from datetime import datetime
from pathlib import Path
import pickle
import random
import json

import hydra
import numpy as np
import torch
from torch.utils.tensorboard import SummaryWriter
from torch.optim import Adam
from torch.utils.data import DataLoader
from tqdm import tqdm
import wandb

from examples.imitation_learning.model import ImitationAgent
from examples.imitation_learning.waymo_data_loader import WaymoDataset


def set_seed_everywhere(seed):
    """Ensure determinism."""
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)


@hydra.main(config_path="../../cfgs/imitation", config_name="config")
def main(args):
    """Train an IL model."""
    set_seed_everywhere(args.seed)
    expert_bounds = [[-6, 6], [-0.7, 0.7]]
        
    # load dataloader config
    dataloader_config = {
        'tmin': 0,
        'tmax': 90,
        'view_dist': 80,
        'view_angle': np.radians(120),
        'dt': 0.1,
        'expert_action_bounds': expert_bounds,
        'expert_position': False,
        'state_normalization': 100,
        'n_stacked_states': 5,
        'perturbations': False,
    }

    scenario_config = {
        'start_time': 0,
        'allow_non_vehicles': True,
        'spawn_invalid_objects': True,
        'max_visible_road_points': 500,
        'sample_every_n': 1,
        'road_edge_first': False,
    }
    
    dataset = WaymoDataset(
        data_path=args.path,
        file_limit=args.num_files,
        dataloader_config=dataloader_config,
        scenario_config=scenario_config,
    )
    data_loader = iter(
        DataLoader(
            dataset,
            batch_size=args.batch_size,
            num_workers=args.n_cpus,
            pin_memory=True,
        ))

    # create exp dir
    time_str = datetime.now().strftime('%Y_%m_%d_%H_%M_%S')
    exp_dir = Path.cwd() / Path('train_logs') / time_str
    exp_dir.mkdir(parents=True, exist_ok=True)

    # train loop
    for epoch in range(args.epochs):
        print(f'\nepoch {epoch+1}/{args.epochs}')
        n_samples = epoch * args.batch_size * (args.samples_per_epoch //
                                               args.batch_size)

        for i in tqdm(range(args.samples_per_epoch // args.batch_size),
                      unit='batch'):
            # get states and expert actions
            states, expert_actions = next(data_loader)


if __name__ == '__main__':
    main()

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

I see. Yes, the assertion errors are deterministic but they only show up when nocturne is compiled with debug flag on. Floating point exceptions are not deterministic and I don't have a clear idea where they are coming from. I'll run your script later on my machine and let you know the outcome.

EDIT: Got delayed. I'll run it today.

from nocturne.

BenQLange avatar BenQLange commented on August 11, 2024

Thanks!

from nocturne.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.