Invoke Endpoint response time out.
Reproduction Steps
{
"trainingJob": {
"hyperparameters": {
"n-hidden": "2",
"n-epochs": "100",
"lr":"1e-2"
},
"instanceType": "ml.c5.9xlarge",
"timeoutInSeconds": 10800
}
}
Error Log
In Inference Lambda CloudWatch:
Task timed out after 120.10 seconds
In Sagemaker Training CloudWatch:
2021-04-09 04:53:46,902 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar
2021-04-09 04:53:49,837 [INFO ] main org.pytorch.serve.archive.ModelArchive - eTag 8ff2b3de4bed4fb1bc7fe969652117ff
2021-04-09 04:53:49,847 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2021-04-09 04:53:49,865 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2021-04-09 04:53:49,930 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-04-09 04:53:49,930 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-04-09 04:53:49,931 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2021-04-09 04:53:49,957 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.ts.sock.9000
2021-04-09 04:53:49,959 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]55
2021-04-09 04:53:49,959 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2021-04-09 04:53:49,959 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.13
2021-04-09 04:53:49,963 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2021-04-09 04:53:49,972 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2021-04-09 04:53:50,017 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:33.3|#Level:Host|#hostname:model.aws.local,timestamp:1617944030
2021-04-09 04:53:50,017 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:19.622234344482422|#Level:Host|#hostname:model.aws.local,timestamp:1617944030
2021-04-09 04:53:50,017 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:4.731609344482422|#Level:Host|#hostname:model.aws.local,timestamp:1617944030
2021-04-09 04:53:50,017 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:19.4|#Level:Host|#hostname:model.aws.local,timestamp:1617944030
2021-04-09 04:53:50,018 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:30089.12109375|#Level:Host|#hostname:model.aws.local,timestamp:1617944030
2021-04-09 04:53:50,018 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:902.6953125|#Level:Host|#hostname:model.aws.local,timestamp:1617944030
2021-04-09 04:53:50,018 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:4.1|#Level:Host|#hostname:model.aws.local,timestamp:1617944030
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable. Valid options are: pytorch, mxnet, tensorflow (all lowercase)
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - ------------------ Loading model -------------------
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 176, in
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server()
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 148, in run_server
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket)
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 112, in handle_connection
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg)
2021-04-09 04:53:51,250 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 85, in load_model
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size)
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/ts/model_loader.py", line 117, in load
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - model_service.initialize(service.context)
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/home/model-server/tmp/models/8ff2b3de4bed4fb1bc7fe969652117ff/handler_service.py", line 51, in initialize
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - super().initialize(context)
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self._service.validate_and_initialize(model_dir=model_dir)
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 158, in validate_and_initialize
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self._model = self._model_fn(model_dir)
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/ml/model/code/fd_sl_deployment_entry_point.py", line 149, in model_fn
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - rgcn_model.load_state_dict(stat_dict)
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.class.name, " \t".join(error_msgs)))
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: Error(s) in loading state_dict for HeteroRGCN:
2021-04-09 04:53:51,251 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - #011size mismatch for layers.0.weight.DeviceInfo<>target.weight: copying a param with shape torch.Size([2, 390]) from checkpoint, the shape in current model is torch.Size([16, 390]).
2021-04-09 04:53:51,252 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - #011size mismatch for layers.0.weight.DeviceInfo<>target.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([16]).
2021-04-09 04:53:51,252 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - #011size mismatch for layers.0.weight.DeviceType<>target.weight: copying a param with shape torch.Size([2, 390]) from checkpoint, the shape in current model is torch.Size([16, 390]).
2021-04-09 04:53:51,252 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - #011size mismatch for layers.0.weight.DeviceType<>target.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([16]).
2021-04-09 04:53:51,252 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - #011size mismatch for layers.0.weight.P_emaildomain<>target.weight: copying a param with shape torch.Size([2, 390]) from checkpoint, the shape in current model is torch.Size([16, 390]).
2021-04-09 04:53:51,252 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - #011size mismatch for layers.0.weight.P_emaildomain<>target.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([16]).
Environment
- CDK CLI Version:
- Framework Version:
- Node.js Version:
- OS :
Other
Cause of this bug:
Backend worker process died.
Sagemaker Endpoint deployment code and model training code parameter conflict on n-hidden and hidden_size.
This is 🐛 Bug Report