@zhanghang1989 : Below is the MXNet error I get when I call scheduler.run() two times in the same Python session. Here are the steps to reproduce:
-
checkout tabular branch: https://github.com/awslabs/autogluon/tree/tabular
git checkout tabular
-
install tabular module by following steps in tabular/README: https://github.com/awslabs/autogluon/blob/tabular/tabular/README.md
-
Verify your installation worked by running the simple example in:
https://github.com/awslabs/autogluon/blob/tabular/autogluon/task/predict_table_column/examples/example_tabular_predictions.py
Note that this example does not do any HPO and does not use ag.schedulers at all.
You can run this example many times in a row inside the same Python session without any segfault issue.
- Now try running the example in:
https://github.com/awslabs/autogluon/blob/tabular/autogluon/task/predict_table_column/examples/example_advanced_tabular.py
This example should also work (it may produce tons of warnings, but should not produce any MXNet segfault). This example demonstrates doing HPO during task.fit() by leveraging the ag.scheduler and internally calls scheduler.run() one time. The key line of code that does this is: https://github.com/awslabs/autogluon/blob/tabular/autogluon/task/predict_table_column/examples/example_advanced_tabular.py#L30
predictor = task.fit(train_data=train_data, label=label_column, output_directory=savedir, hyperparameter_tune=True, num_trials=10, time_limits=10*60, nn_options=nn_options)
- Now re-run this same example, but instead of calling ag.done() after running the above line of code, try running this line of code two times in a row, ie.
`predictor = task.fit(train_data=train_data, label=label_column, output_directory=savedir, hyperparameter_tune=True,
num_trials=10, time_limits=10*60, nn_options=nn_options)
predictor = task.fit(train_data=train_data, label=label_column, output_directory=savedir, hyperparameter_tune=True,
num_trials=10, time_limits=10*60, nn_options=nn_options)
`
- This should produce the segfault error below:
0%| | 0/10 [00:00<?, ?it/s]
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 ??? 0x0000000000000000 0x0 + 0
[bt] (3) 4 libBLAS.dylib 0x00007fff4fac5d44 APL_sgemm + 806
[bt] (4) 5 libBLAS.dylib 0x00007fff4fa504c2 cblas_sgemm + 1592
[bt] (5) 6 libmxnet.so 0x000000011654e8b5 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 14421
[bt] (6) 7 libmxnet.so 0x000000011654b5f8 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 1432
[bt] (7) 8 libmxnet.so 0x000000011654b363 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 771
[bt] (8) 9 libmxnet.so 0x000000011774dca9 mxnet::imperative::PushFComputeEx(std::__1::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::Resource, std::__1::allocatormxnet::Resource > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&)::'lambda'(mxnet::RunContext)::operator()(mxnet::RunContext) const + 217
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 ??? 0x0000000000000000 0x0 + 0
[bt] (3) 4 libBLAS.dylib 0x00007fff4fac5d44 APL_sgemm + 806
[bt] (4) 5 libBLAS.dylib 0x00007fff4fa504c2 cblas_sgemm + 1592
[bt] (5) 6 libmxnet.so 0x000000011654e8b5 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 14421
[bt] (6) 7 libmxnet.so 0x000000011654b5f8 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 1432
[bt] (7) 8 libmxnet.so 0x000000011654b363 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 771
[bt] (8) 9 libmxnet.so 0x000000011774dca9 mxnet::imperative::PushFComputeEx(std::__1::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::Resource, std::__1::allocatormxnet::Resource > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&)::'lambda'(mxnet::RunContext)::operator()(mxnet::RunContext) const + 217
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 ??? 0x000000010e3eea00 0x0 + 4533971456
[bt] (3) 4 libmxnet.so 0x00000001161c4ad3 std::__1::map<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, mxnet::NDArrayFunctionReg*, std::__1::less<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const, mxnet::NDArrayFunctionReg*> > >::__find_equal_key(std::__1::__tree_node_base<void*>&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&) + 867
[bt] (4) 5 libmxnet.so 0x0000000117479f3a void mxnet::op::FillComputeZerosExmshadow::cpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 666
[bt] (5) 6 libmxnet.so 0x0000000117685b62 SetNDInputsOutputs(nnvm::Op const, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 3330
[bt] (6) 7 libmxnet.so 0x00000001176853c8 SetNDInputsOutputs(nnvm::Op const*, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 1384
[bt] (7) 8 libmxnet.so 0x00000001176861d0 MXImperativeInvokeEx + 176
[bt] (8) 9 _ctypes.cpython-37m-darwin.so 0x000000010f609367 ffi_call_unix64 + 79
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 Python 0x000000010e1a6fdd member_set + 52
[bt] (3) 4 libmxnet.so 0x00000001161c4ad3 std::__1::map<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, mxnet::NDArrayFunctionReg*, std::__1::less<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const, mxnet::NDArrayFunctionReg*> > >::__find_equal_key(std::__1::__tree_node_base<void*>&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&) + 867
[bt] (4) 5 libmxnet.so 0x0000000117479f3a void mxnet::op::FillComputeZerosExmshadow::cpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 666
[bt] (5) 6 libmxnet.so 0x0000000117685b62 SetNDInputsOutputs(nnvm::Op const, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 3330
[bt] (6) 7 libmxnet.so 0x00000001176853c8 SetNDInputsOutputs(nnvm::Op const*, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 1384
[bt] (7) 8 libmxnet.so 0x00000001176861d0 MXImperativeInvokeEx + 176
[bt] (8) 9 _ctypes.cpython-37m-darwin.so 0x000000010f609367 ffi_call_unix64 + 79