aws-samples / amazon-sagemaker-pytorch-detectron2 Goto Github PK
View Code? Open in Web Editor NEWThis repository shows how to train an object detection algorithm with Detectron2 on Amazon SageMaker
License: MIT No Attribution
This repository shows how to train an object detection algorithm with Detectron2 on Amazon SageMaker
License: MIT No Attribution
Hi,
I want to change the serve Dockerfile in a way that is similar to the training Dockerfile where it defines an entrypoint on its own.
The reason for this is that I do not want to pass anything in the source directory parameter when building the model so that the training artifact can be used directly.
For Example:
model = PyTorchModel(
name="d2-model",
model_data=training_job_artifact,
role=role,
sagemaker_session=sm_session,
entry_point="predict_det2.py",
# source_dir="container_serving",
image_uri=serve_image_uri,
framework_version="1.6.0",
code_location=f"s3://{bucket}/{prefix_code}",
)
Where the Dockerfile is as this ( I have tried to add multiple lines as entrypoint which I will specify in comments):
# Build an image of Detectron2 with Sagemaker Multi Model Server: https://github.com/awslabs/multi-model-server
# using Sagemaker PyTorch container as base image
# from https://github.com/aws/sagemaker-pytorch-serving-container/
ARG REGION
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.5.1-gpu-py36-cu101-ubuntu16.04
LABEL author="[email protected]"
############# Installing latest builds ############
RUN pip install --upgrade torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# ENV FORCE_CUDA="1"
# Build D2 only for Turing (G4) and Volta (P3) architectures. Use P3 for batch transforms and G4 for inference on endpoints
# ENV TORCH_CUDA_ARCH_LIST="Turing;Volta"
# Install Detectron2
RUN pip install \
--no-cache-dir pycocotools~=2.0.0 \
--no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.6/detectron2-0.4%2Bcu101-cp36-cp36m-linux_x86_64.whl
# Set a fixed model cache directory. Detectron2 requirement
ENV FVCORE_CACHE="/tmp"
############# SageMaker section ##############
COPY container_serving /opt/ml/code
WORKDIR /opt/ml/code
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM predict_sku110k.py
WORKDIR /
# ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"] # As this exists in PyTorch Inference Toolkit "artifact" folder so I thought maybe it could work
# ENTRYPOINT ["python", "predict_sku110k.py"]
ENTRYPOINT ["python", "/opt/ml/code/predict_sku110k.py"]
I created a model using the images built from these dockerfiles. The model is created successfully and also shows up on AWS sagemaker console but when I initiate a Batch Transform job it failed because it is not able to locate the predict_sku110k.py
file.
Any help would be appreciated, I just want this image to create models that can directly use training artifacts. The reason I have understood so far that it doesn't is because the artifacts do not have a code
directory inside them which contains this predict_sku110.py
file. When the model is created however then it uploads a new artifact to S3 with the code
directory which helps it do inference.
If somehow this code directory can be bundled inside the training image so that when the training artifact is created it is already there, then that would work as well I presume.
Thanks a lot,
AliButtar
The notebook cell below has hardcoded bucket, but this should be the users bucket:
channel_to_expected_size = {
"training": 8215,
"validation": 588,
"test": 2934,
}
prefix_data = "detectron2/data"
bucket_rsr = boto3.resource("s3").Bucket("sagemaker-sku110k-dataset")
Hi there!
I am trying to make predictions on a AWS Batch Job using model weights (checkpoint) stored on a S3 bucket, this way:
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor
cfg = get_cfg()
cfg.MODEL.WEIGHTS = 's3://bucket/folder/model_final.pth'
predictor = DefaultPredictor(cfg)
outputs = predictor(side_dish_image)
However, the docker job doesn't seem to have a direct connection with the S3 bucket, and I get the following:
AssertionError: Checkpoint s3://bucket/folder/model_final.pth not found!
Is there a way I can access the checkpoint from S3? I would prefer to avoid creating an endpoint for inference (I haven't tried it yet).
EDIT: Also, I could copy it to the Docker, but that's highly inefficient (335 MB).
Best regards,
Rubén.
Hello, does this repo support incremental training?
Can I provide a model as a starting point for training, versus starting from scratch?
https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html
Regards,
M
To push to ECR docker must be on the sagemaker instance, which it is not by default and must be installed else error docker: command not found
is returned. Please can the install instructions be included
Hi,
I am working on this notebook, to use detectron2's Retinanet on our custom dataset, instead of SKU110k dataset. Our dataset has 140 labels, so while running the training job this error is coming up
'hyperParameters' failed to satisfy constraint: Map value must satisfy constraint: [Member must have length less than or equal to 2500, Member must have length greater than or equal to 0, Member must satisfy regular expression pattern: .*]
Can you suggest a workaround for this issue?
TIA
I have been trying to re-adapt your code to deploy the trained model as an endpoint, but I am having some troubles when trying to send images to the endpoint to get a prediction.
A detailed explanation of the problem can be found here: https://stackoverflow.com/questions/70033312/invoke-endpoint-error-detectron2-on-aws-sagemaker-valueerror-type-applicati
Any idea of how to solve this problem, or suggestion on how to perform single-image inference rather than batch transform?
When running the
%%bash
./build_and_push.sh sagemaker-d2-train-sku110k latest Dockerfile.sku110ktraining
cell there is a docker error
repository 663822598777.dkr.ecr.us-east-1.amazonaws.com/pytorch-training not found: name unknown: The repository with name 'pytorch-training' does not exist in the registry with id '663822598777'
Hello,
I am running hyperparameter optimization on my custom dataset. But the job is failing with Cuda memory error,
Can you elaborate on how this issue should be tackled?
TIA
Description:
Every time I am trying to build a dockerfile, it fails at pulling the image, by throwing the error:
=> ERROR [internal] load metadata for 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.6.0-gpu 0.3s ------ > [internal] load metadata for 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04: ------ failed to solve with frontend dockerfile.v0: failed to create LLB definition: unexpected status code [manifests 1.6.0-gpu-py36-cu101-ubuntu16.04]: 401 Unauthorized
DLC image/dockerfile:
ARG REGION="eu-central-1"
#FROM python:3.8
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04
RUN pip install --upgrade torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
RUN pip install \
--no-cache-dir pycocotools~=2.0.0 \
--no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.6/detectron2-0.4%2Bcu101-cp36-cp36m-linux_x86_64.whl
ENV FORCE_CUDA="1"
ENV FVCORE_CACHE="/tmp"
COPY container_training/sku-110k /opt/ml/code
WORKDIR /opt/ml/code
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM training.py
WORKDIR /
ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]
Current behavior:
The build fails.
Expected behavior:
The build succeeds.
Additional context:
I have to mention that I perform login before trying to build the docker file, and the build succeeds.
Hi,
First, thanks for the great code, It is really easy to navigate through and understand and make changes.
So the problem is I have been trying out various experiments to run detectron2 on sagemaker but for some reason, any other version than the one which is being used does not run. I tried to use torch version 1.8 and 1.9 from Deep learning containers and made appropriate changes to the Dockerfile and also changed the link to the detectron2 to try out versions 0.5 and 0.6 but the training never starts.
I also made changes in the training.py
file to not use the custom trainer and the data you had but instead use the default Trainer, Evaluator and also load up data from COCO instances to make it a lot more generalized. I have been successful in running this training on the current detecton2 version and Pytorch version but for some reason when I try to upgrade to a newer one, it gives out strange errors and just fails the training job.
Can you please guide me on what could be the problem here? Is there any code inside training.py
that is specific to detectron2 version 0.4 hence it only runs on that? I can provide any additional information that you require to help me debug this.
I also tried to run this repo as it is and didn't change anything except PyTorch versions and detectron2 version but it still failed.
Thanks,
AliButtar
I'm running your example as a demo and it keeps failing whenever I introduce evaluation to the training.
I get the following error:
TypeError: _eval_predictions() missing 1 required positional argument: 'predictions'
Traceback (most recent call last):File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrapfn(i, *args)File "/opt/conda/lib/python3.6/site-packages/detectron2/engine/launch.py", line 94, in _distributed_workermain_func(*args)File "/opt/ml/code/training.py", line 172, in _train_implTrainer.test(cfg, model, evaluator)File "/opt/conda/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 552, in testresults_i = inference_on_dataset(model, data_loader, evaluator)File "/opt/conda/lib/python3.6/site-packages/detectron2/evaluation/evaluator.py", line 182, in inference_on_datasetresults = evaluator.evaluate()File "/opt/conda/lib/python3.6/site-packages/detectron2/evaluation/coco_evaluation.py", line 175, in evaluateself._eval_predictions(predictions, img_ids=img_ids)
--
Have you ran into this issue?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.