Comments (30)
I don't think we can write a try, except block for floating point exceptions or assertion errors. I tried and it was still killing the worker and stopping the script.
Instead, I have iterated through the dataset with the above configs and created a dictionary of failing files (bash script with a loop until it finished iterating through a dataset). For now, I just skip those files during training.
from nocturne.
Oh that's super useful that you can reproduce it without the training! So it's in the worker or possibly in Nocturne itself...
I'll try to reproduce it using the smaller dataset but otherwise it'll be a few days until my new laptop arrives and I can do some analysis on the full dataset, sorry!
from nocturne.
I have enabled debug option in setup.py. Now I am getting the following errors:
python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41: nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const: Assertion `t >= 0.0f && t <= 1.0f' failed.
Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67: nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&): Assertion `VerifyVerticesOrder()' failed.
Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
I am not a C++ wizard. Is it possible that those assertion errors lead to the floating point exception?
from nocturne.
I think that's probably it; great job and thank you!! @xiaomengy (our C++ wizard) do you see how this error could occur? We could definitely use your insight here
from nocturne.
Hey @BenQLange, just to give you an update we're slightly backlogged but Xiaomeng will take a look at this on Tuesday. Figured it was better to have a time than persistent uncertainty
from nocturne.
Hi @BenQLange. Sorry for being late because of some other deadlines. I will take a detailed look later today and hopefully resolve it ASAP.
from nocturne.
Thanks for finding those! We are still looking into it but in the meantime would including a try, except block in your code temporarily resolve this issue so that you aren't blocked? We should have a resolution soon.
from nocturne.
Modified dataset resolves the assertion errors but I am still experiencing floating point exceptions from time to time :(
from nocturne.
Hmm, we are still looking into it. I just got a new laptop with enough space for the whole dataset so hopefully I can reconstruct your errors and help.
from nocturne.
Oh! Okay, let me throw on the debug flag and try again. Thanks for the suggestion.
from nocturne.
Hi @BenQLange. Just let you know a progress. It seems there exists one vehicle/object that has a negative length in tfrecord-00008-of-01000_364.json, which is at least the reason of assert failure. Now we are investigating why there is such values and will try to have some solution to deal with such cases.
We found an objects with shape of "width": 4.4137163162231445, "length": -1.295910358428955 in tfrecord-00008-of-01000_364.json
from nocturne.
We're following up with Waymo here waymo-research/waymo-open-dataset#542 and will hopefully find some resolution (though the floating point error is probably from a different source).
from nocturne.
Ooof; thank you for catching and reporting this. We have never seen this.
A few questions to see if anything is different from your setup then ours.
Question 1: are you training on the mini-dataset or the full dataset?
Question 2: Are you using all the files or a subset of the files i.e. are you modifying the value of num_files in the config?
Then, a reproducibility step:
- Is there any chance you can get the dataloader to print the scenario_path when this happens? This is a value defined in the _get_waymo_iterator here (https://github.com/facebookresearch/nocturne/blob/main/examples/imitation_learning/waymo_data_loader.py). Seeing this might help us investigate the right file and find it faster.
- Could you print the state and action values on the off-chance you observe whether it's a state or an action?
from nocturne.
Once I know if it's the mini or full-dataset and how many files you are using, I'll run the dataloader over the relevant files and see if we can find the file where this error occurs.
from nocturne.
Sounds good. I am using the full dataset with num_files set to -1 (entire dataset). I'll let you when I know the file name.
from nocturne.
Thanks for that info!
@xiaomengy I'm going to write a quick script tomorrow to search through the dataset and build samples from the dataloader and return any files if they throw an error. Would you be able to run it on the cluster and send me any file-names it flags?
from nocturne.
One last question, does it ever train to completion or is this blocking you from completing any training run? Just trying to get a sense of how rare it is.
from nocturne.
It does train to completion most often. It fails 20%ish of the time
from nocturne.
Great, that's useful information.
from nocturne.
It's weird. It's not caused by a specific file. Sometimes it iterates through all files with no issue, sometimes it crashes :(
from nocturne.
Well, it's interesting, it's caused in the call to distribution so I'm wondering if there's actually just a model creating a NaN in the step between passing the state through the head and before passing the output of that to the MultiVariateNormal distribution rather than a file error. It seems to be complaining that a value and its transpose are not close? Since the model training is running in serial you could throw a breakpoint into a try, except block and see what is being passed when that method errors?
I'll try to help more but I have yet to reproduce the issue on my local machine (admittedly, training is slow). Will be faster once I get access to a cluster again.
from nocturne.
Ah! One more thing that @nathanlct pointed out, are you using Discrete actions or Continuous actions? We've only extensively tested the discrete setting, perhaps the precision / covariance matrix is acting up
from nocturne.
I don't think it's related to the call to distribution. It happens for both action and position action spaces. When I just iterate through the dataset in a simple script I am sometimes (but not always ?) getting a floating point exception. I have only screenshots of the traceback (sorry).
It's really confusing.
from nocturne.
This is the backtrace with gdb when it fails:
Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x00007fff4a29cb85 in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
(gdb) bt
#0 0x00007fff4a29cb85 in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#1 0x00007fff4a2b08ab in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#2 0x00007fff4a2ab29a in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#3 0x00007fff4a2ad5dd in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#4 0x00007fff4a27dd2a in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#5 0x00007fff4a274a98 in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#6 0x00007fff4a269c5d in ?? ()
from /home/bernard.lange/nocturne/nocturne_cpp.cpython-38-x86_64-linux-gnu.so
#7 0x000055555569000e in cfunction_call_varargs ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:743
#8 0x000055555568513f in _PyObject_MakeTpCall ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#9 0x00005555556bacba in _PyObject_Vectorcall (kwnames=0x0, nargsf=3, args=0x7fffffffd2d0,
callable=0x7fff4a6e83b0)
at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#10 method_vectorcall ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/classobject.c:89
#11 0x000055555568b20d in PyVectorcall_Call (kwargs=0x0, tuple=0x7fff4a50b840,
callable=0x7fffa512b7c0)
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:200
#12 PyObject_Call () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:228
#13 0x000055555562f9cb in slot_tp_init (self=0x7fff43099cf0, args=0x7fff4a50b840, kwds=0x0)
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/typeobject.c:6793
#14 0x000055555568ff27 in type_call ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/typeobject.c:994
#15 0x00007fffeec764b9 in pybind11_meta_call ()
from /home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#16 0x000055555568513f in _PyObject_MakeTpCall ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#17 0x000055555572f89f in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>,
args=0x55555856e4d8, callable=0x555558b66080)
at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#18 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>,
tstate=0x5555558f3ff0)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#19 _PyEval_EvalFrameDefault ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#20 0x00005555557210ff in PyEval_EvalFrameEx (throwflag=0, f=0x55555856e240)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#21 _PyEval_EvalCodeWithName ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#22 0x0000555555721bc4 in _PyFunction_Vectorcall ()
at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:436
#23 0x000055555572b0bb in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>,
args=0x7ffff6f6a5b8, callable=0x7ffff6fd0310)
---Type <return> to continue, or q <return> to quit---
8/work/Include/cpython/abstract.h:127
#24 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555558f3ff0)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#25 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#26 0x0000555555720600 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff6f6a440)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#27 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#28 0x0000555555721eb3 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0,
args=0x0, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4327
#29 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:718
#30 0x0000555555796622 in run_eval_code_obj () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1166
#31 0x00005555557a71d2 in run_mod () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1188
#32 0x00005555557aa36b in pyrun_file () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:1085
#33 0x00005555557aa54f in pyrun_simple_file (flags=0x7fffffffdb08, closeit=1, filename=0x7ffff6e8b4b0, fp=0x55555596b500)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:439
#34 PyRun_SimpleFileExFlags () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/pythonrun.c:472
#35 0x00005555557aaa29 in pymain_run_file (cf=0x7fffffffdb08, config=0x5555558f3020)
at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:391
#36 pymain_run_python (exitcode=0x7fffffffdb00) at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:616
#37 Py_RunMain () at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:695
#38 0x00005555557aac29 in Py_BytesMain () at /opt/conda/conda-bld/python-split_1648465063888/work/Modules/main.c:1127
#39 0x00007ffff703fc87 in __libc_start_main (main=0x55555565bea0 <main>, argc=2, argv=0x7fffffffdcf8, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdce8) at ../csu/libc-start.c:310
#40 0x000055555574dad7 in _start ()
Does it tell you anything about the root cause?
from nocturne.
Here is a backtrace for the line segment error:
python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41: nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const: Assertion `t >= 0.0f && t <= 1.0f' failed.
Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff705e7f1 in __GI_abort () at abort.c:79
#2 0x00007ffff704e3fa in __assert_fail_base (fmt=0x7ffff71d56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7fff4a2cf33a "t >= 0.0f && t <= 1.0f",
file=file@entry=0x7fff4a2cf2f0 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h",
line=line@entry=41,
function=function@entry=0x7fff4a2cf360 <nocturne::geometry::LineSegment::Point(float) const::__PRETTY_FUNCTION__> "nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const") at assert.c:92
#3 0x00007ffff704e472 in __GI___assert_fail (assertion=assertion@entry=0x7fff4a2cf33a "t >= 0.0f && t <= 1.0f",
file=file@entry=0x7fff4a2cf2f0 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h",
line=line@entry=41,
function=function@entry=0x7fff4a2cf360 <nocturne::geometry::LineSegment::Point(float) const::__PRETTY_FUNCTION__> "nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const") at assert.c:101
#4 0x00007fff4a2b10b4 in nocturne::geometry::LineSegment::Point (t=<optimized out>, this=<optimized out>)
at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41
#5 nocturne::(anonymous namespace)::VisibleObjectsImpl (objects=std::vector of length 16, capacity 32 = {...}, o=...,
points=std::vector of length 72, capacity 128 = {...}) at /home/bernard.lange/nocturne/nocturne/cpp/src/view_field.cc:84
#6 0x00007fff4a2b2507 in nocturne::ViewField::FilterVisibleObjects (this=this@entry=0x7fffffffcce0,
objects=std::vector of length 16, capacity 32 = {...}) at /home/bernard.lange/nocturne/nocturne/cpp/src/view_field.cc:156
#7 0x00007fff4a286541 in nocturne::Scenario::VisibleObjects (this=this@entry=0x55555d6cadc0, src=...,
view_dist=view_dist@entry=80, view_angle=view_angle@entry=2.09439516, head_angle=head_angle@entry=0)
at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:362
#8 0x00007fff4a28852e in nocturne::Scenario::FlattenedVisibleState (this=0x55555d6cadc0, src=...,
view_dist=view_dist@entry=80, view_angle=view_angle@entry=2.09439516, head_angle=head_angle@entry=0)
at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:508
#9 0x00007fff4a267742 in nocturne::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>::operator() (__closure=<optimized out>, head_angle=0, view_angle=2.09439516, view_dist=80, src=..., scenario=...)
at /home/bernard.lange/nocturne/nocturne/pybind11/src/scenario.cc:73
#10 pybind11::detail::argument_loader<nocturne::Scenario const&, nocturne::Object const&, float, float, float>::call_impl<pybind11::array_t<float, 16>, nocturne::DefineScenario(pybind11::module&)::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>&, 0, 1, 2, 3, 4, pybind11::detail::void_type> (f=..., this=0x7fffffffd030)
at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/cast.h:1418
#11 pybind11::detail::argument_loader<nocturne::Scenario const&, nocturne::Object const&, float, float, float>::call<pybind11::array_t<float, 16>, pybind11::detail::void_type, nocturne::DefineScenario(pybind11::module&)::<lambda(const nocturne::Scenario&, const nocturne::Object&, float, float, float)>&> (f=..., this=<optimized out>)
at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/cast.h:1387
#12 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator() (__closure=0x0, call=...)
at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:249
#13 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::_FUN(pybind11::detail::function_call &) ()
at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:224
#14 0x00007fff4a243e49 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fff43096ec0,
kwargs_in=0x7fff4a503280) at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/pybind11.h:924
#15 0x000055555569000e in cfunction_call_varargs () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:743
#16 0x000055555568513f in _PyObject_MakeTpCall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:159
#17 0x00005555556baca0 in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x55555856ca40,
callable=0x7fff4a688f40) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:125
#18 method_vectorcall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/classobject.c:60
#19 0x000055555572beb0 in _PyObject_Vectorcall (kwnames=0x7ffff6e66280, nargsf=<optimized out>, args=<optimized out>,
callable=0x7fff53c96580) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:127
#20 call_function (kwnames=0x7ffff6e66280, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=<optimized out>)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#21 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3515
#22 0x00005555557210ff in PyEval_EvalFrameEx (throwflag=0, f=0x55555856c7a0)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#23 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#24 0x0000555555721bc4 in _PyFunction_Vectorcall () at /opt/conda/conda-bld/python-split_1648465063888/work/Objects/call.c:436
#25 0x000055555572b0bb in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff6f6a5b8,
callable=0x7ffff6fcf310) at /opt/conda/conda-bld/python-split_1648465063888/work/Include/cpython/abstract.h:127
#26 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x5555558f3ff0)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4963
#27 _PyEval_EvalFrameDefault () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:3500
#28 0x0000555555720600 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff6f6a440)
at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:741
#29 _PyEval_EvalCodeWithName () at /opt/conda/conda-bld/python-split_1648465063888/work/Python/ceval.c:4298
#30 0x0000555555721eb3 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0,
---Type <return> to continue, or q <return> to quit---
And for the polygon error:
python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67: nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&): Assertion `VerifyVerticesOrder()' failed.
Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff705e7f1 in __GI_abort () at abort.c:79
#2 0x00007ffff704e3fa in __assert_fail_base (
fmt=0x7ffff71d56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7fff4a2c88ad "VerifyVerticesOrder()",
file=file@entry=0x7fff4a2c8868 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h", line=line@entry=67,
function=function@entry=0x7fff4a2c88e0 <nocturne::geometry::ConvexPolygon::ConvexPolygon(std::initializer_list<nocturne::geometry::Vector2D> const&)::__PRETTY_FUNCTION__> "nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&)")
at assert.c:92
#3 0x00007ffff704e472 in __GI___assert_fail (
assertion=assertion@entry=0x7fff4a2c88ad "VerifyVerticesOrder()",
file=file@entry=0x7fff4a2c8868 "/home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h", line=line@entry=67,
function=function@entry=0x7fff4a2c88e0 <nocturne::geometry::ConvexPolygon::ConvexPolygon(std::initializer_list<nocturne::geometry::Vector2D> const&)::__PRETTY_FUNCTION__> "nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&)")
at assert.c:101
#4 0x00007fff4a280052 in nocturne::geometry::ConvexPolygon::ConvexPolygon (vertices=..., this=0x7fffffffc370)
at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67
#5 nocturne::Object::BoundingPolygon (this=<optimized out>)
at /home/bernard.lange/nocturne/nocturne/cpp/src/object.cc:27
#6 0x00007fff4a2a810e in nocturne::ObjectBase::GetAABB (
this=<optimized out>)
at /home/bernard.lange/nocturne/nocturne/cpp/include/object_base.h:66
#7 void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}::operator()(std::shared_ptr<nocturne::Object> const&) const (obj=...,
__closure=<synthetic pointer>)
at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:107
#8 nocturne::geometry::BVH::ResetImpl<std::shared_ptr<nocturne::Object>, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#2}>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#1}, void nocturne::geometry::BVH::Reset<nocturne::Object>(std::vector<std::shared_ptr<nocturne::Object>, std::allocator<std::shared_ptr<nocturne::Object> > > const&)::{lambda(std::shared_ptr<nocturne::Object> const&)#2}) (this=this@entry=0x55555a066ed8,
objects=std::vector of length 72, capacity 128 = {...},
aabb_func=..., ptr_func=...)
at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:163
#9 0x00007fff4a28efe9 in nocturne::geometry::BVH::Reset<nocturne::Object> (objects=std::vector of length 72, capacity 128 = {...},
this=0x55555a066ed8)
at /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/bvh.h:105
#10 nocturne::Scenario::LoadObjects (this=this@entry=0x55555a066d00,
objects_json=...)
at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:1127
#11 0x00007fff4a2919ad in nocturne::Scenario::LoadScenario (
this=this@entry=0x55555a066d00,
scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_dataset/val/tfrecord-00506-of-01000_353.json")
at /home/bernard.lange/nocturne/nocturne/cpp/src/scenario.cc:227
#12 0x00007fff4a26a264 in nocturne::Scenario::Scenario (
this=0x55555a066d00,
scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_d
---Type <return> to continue, or q <return> to quit---
ataset/val/tfrecord-00506-of-01000_353.json", config=std::unordered_map with 6 elements = {...})
at /home/bernard.lange/nocturne/nocturne/cpp/include/scenario.h:100
#13 0x00007fff4a2766a0 in std::make_unique<nocturne::Scenario, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::variant<bool, long, float>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::variant<bool, long, float> > > > const&> () at /usr/include/c++/7/bits/unique_ptr.h:821
#14 nocturne::Simulation::Simulation (config=std::unordered_map with 6 elements = {...},
scenario_path="/home/bernard.lange/nocturne_dataset/bernard_nocturne_dataset/val/tfrecord-00506-of-01000_353.json",
this=0x55555c020f20) at /home/bernard.lange/nocturne/nocturne/cpp/include/simulation.h:32
#15 pybind11::detail::initimpl::construct_or_initialize<nocturne::Simulation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::variant<bool, long, float>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::variant<bool, long, float> > > > const&, 0> ()
at /home/bernard.lange/nocturne/third_party/pybind11/include/pybind11/detail/init.h:73
from nocturne.
So floating point exceptions are not deterministic, but assertion errors are. I have identified invalid files in the training set:
array(['tfrecord-00008-of-01000_364.json',
'tfrecord-00104-of-01000_303.json',
'tfrecord-00128-of-01000_365.json',
'tfrecord-00131-of-01000_86.json',
'tfrecord-00214-of-01000_146.json',
'tfrecord-00402-of-01000_57.json',
'tfrecord-00506-of-01000_353.json',
'tfrecord-00689-of-01000_184.json',
'tfrecord-00811-of-01000_413.json',
'tfrecord-00074-of-01000_192.json',
'tfrecord-00090-of-01000_237.json',
'tfrecord-00151-of-01000_418.json',
'tfrecord-00179-of-01000_445.json',
'tfrecord-00203-of-01000_466.json',
'tfrecord-00206-of-01000_87.json',
'tfrecord-00241-of-01000_464.json',
'tfrecord-00247-of-01000_214.json',
'tfrecord-00279-of-01000_81.json',
'tfrecord-00298-of-01000_75.json',
'tfrecord-00325-of-01000_483.json',
'tfrecord-00343-of-01000_188.json',
'tfrecord-00376-of-01000_41.json',
'tfrecord-00396-of-01000_203.json',
'tfrecord-00411-of-01000_295.json',
'tfrecord-00431-of-01000_130.json',
'tfrecord-00472-of-01000_85.json',
'tfrecord-00483-of-01000_62.json',
'tfrecord-00487-of-01000_377.json',
'tfrecord-00532-of-01000_444.json',
'tfrecord-00534-of-01000_37.json',
'tfrecord-00564-of-01000_247.json',
'tfrecord-00567-of-01000_34.json',
'tfrecord-00570-of-01000_361.json',
'tfrecord-00580-of-01000_420.json',
'tfrecord-00616-of-01000_211.json',
'tfrecord-00639-of-01000_188.json',
'tfrecord-00653-of-01000_394.json',
'tfrecord-00711-of-01000_490.json',
'tfrecord-00735-of-01000_12.json',
'tfrecord-00738-of-01000_388.json',
'tfrecord-00754-of-01000_415.json',
'tfrecord-00802-of-01000_74.json',
'tfrecord-00805-of-01000_368.json',
'tfrecord-00810-of-01000_6.json',
'tfrecord-00829-of-01000_456.json',
'tfrecord-00846-of-01000_330.json',
'tfrecord-00863-of-01000_432.json',
'tfrecord-00868-of-01000_297.json',
'tfrecord-00869-of-01000_43.json',
'tfrecord-00924-of-01000_471.json',
'tfrecord-00937-of-01000_364.json',
'tfrecord-00962-of-01000_378.json',
'tfrecord-00984-of-01000_128.json'], dtype='<U32')
Hopefully, that's the reason behind floating point exception errors. I'll let you know after I run some more experiments.
UPDATE: There is more failing files. I didn't iterate over time :(
from nocturne.
Small update, here are the configs I used to find the failing scenes listed above. Depending on some configs I get more or less assertion errors. In particular, I noticed it when changing the view angle. Hope that helps.
# load dataloader config
dataloader_config = {
'tmin': 0,
'tmax': 90,
'view_dist': 80,
'view_angle': np.radians(120),
'dt': 0.1,
'expert_action_bounds': None,
'expert_position': True,
'state_normalization': 100,
'n_stacked_states': 5,
'perturbations': False,
}
scenario_config = {
'start_time': 0,
'allow_non_vehicles': True,
'spawn_invalid_objects': True,
'max_visible_road_points': 500,
'sample_every_n': 1,
'road_edge_first': False,
}
tmin = dataloader_config.get('tmin', 0)
tmax = dataloader_config.get('tmax', 90)
view_dist = dataloader_config.get('view_dist', 80)
view_angle = dataloader_config.get('view_angle', np.radians(120))
dt = dataloader_config.get('dt', 0.1)
expert_action_bounds = dataloader_config.get('expert_action_bounds',
[[-6, 6], [-0.7, 0.7]])
expert_position = dataloader_config.get('expert_position', True)
state_normalization = dataloader_config.get('state_normalization', 100)
n_stacked_states = dataloader_config.get('n_stacked_states', 5)
from nocturne.
Are the errors on the files you listed deterministic? I've constructed the subset of files that you have and looped the dataloader over them but am not seeing an error yet.
Reproduction script for reference
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""Imitation learning training script (behavioral cloning)."""
from datetime import datetime
from pathlib import Path
import pickle
import random
import json
import hydra
import numpy as np
import torch
from torch.utils.tensorboard import SummaryWriter
from torch.optim import Adam
from torch.utils.data import DataLoader
from tqdm import tqdm
import wandb
from examples.imitation_learning.model import ImitationAgent
from examples.imitation_learning.waymo_data_loader import WaymoDataset
def set_seed_everywhere(seed):
"""Ensure determinism."""
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
@hydra.main(config_path="../../cfgs/imitation", config_name="config")
def main(args):
"""Train an IL model."""
set_seed_everywhere(args.seed)
expert_bounds = [[-6, 6], [-0.7, 0.7]]
# load dataloader config
dataloader_config = {
'tmin': 0,
'tmax': 90,
'view_dist': 80,
'view_angle': np.radians(120),
'dt': 0.1,
'expert_action_bounds': expert_bounds,
'expert_position': False,
'state_normalization': 100,
'n_stacked_states': 5,
'perturbations': False,
}
scenario_config = {
'start_time': 0,
'allow_non_vehicles': True,
'spawn_invalid_objects': True,
'max_visible_road_points': 500,
'sample_every_n': 1,
'road_edge_first': False,
}
dataset = WaymoDataset(
data_path=args.path,
file_limit=args.num_files,
dataloader_config=dataloader_config,
scenario_config=scenario_config,
)
data_loader = iter(
DataLoader(
dataset,
batch_size=args.batch_size,
num_workers=args.n_cpus,
pin_memory=True,
))
# create exp dir
time_str = datetime.now().strftime('%Y_%m_%d_%H_%M_%S')
exp_dir = Path.cwd() / Path('train_logs') / time_str
exp_dir.mkdir(parents=True, exist_ok=True)
# train loop
for epoch in range(args.epochs):
print(f'\nepoch {epoch+1}/{args.epochs}')
n_samples = epoch * args.batch_size * (args.samples_per_epoch //
args.batch_size)
for i in tqdm(range(args.samples_per_epoch // args.batch_size),
unit='batch'):
# get states and expert actions
states, expert_actions = next(data_loader)
if __name__ == '__main__':
main()
from nocturne.
I see. Yes, the assertion errors are deterministic but they only show up when nocturne is compiled with debug flag on. Floating point exceptions are not deterministic and I don't have a clear idea where they are coming from. I'll run your script later on my machine and let you know the outcome.
EDIT: Got delayed. I'll run it today.
from nocturne.
Thanks!
from nocturne.
Related Issues (20)
- Add support for running Nocturne entirely on GPU
- Issues Running MAPPO HOT 11
- [Bug] HOT 9
- [Feature] Scenario method to return all visible vehicles / pedestrians / cyclists
- [Question] Need Help Installing HOT 9
- [Feature] Add installation instructions for M1 chips HOT 1
- [Feature] Add support for Argoverse 2
- Will the other cars in the simulation react to the ego agent's actions? HOT 9
- [Bug] Unexpected behavior when setting padding=True in scenario.visible_state() HOT 3
- [Feature] Pretrained agents HOT 2
- [Bug] Dependencies not listed in environment file HOT 1
- [Bug] Problem installing it on Windows 10 HOT 1
- [Bug] Mistake in continuous action space definition HOT 2
- [Question] Optimal hyperparameters and scripts to reach 2000 steps/sec training speed HOT 3
- Comparison of missing driving simulators in the nocturne paper[Question] HOT 3
- [Question] SFML/Graphics.hpp file not found HOT 4
- [Feature] Is it possible to support user-specific map? HOT 1
- [Feature] Don't install `cfgs` module as a top level Python import HOT 4
- [Question] How is the masking achieved? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nocturne.