aws-samples / amazon-sagemaker-pytorch-detectron2 Goto Github PK

View Code? Open in Web Editor NEW

27.0 12.0 15.0 55 KB

This repository shows how to train an object detection algorithm with Detectron2 on Amazon SageMaker

License: MIT No Attribution

Shell 2.26% Python 53.69% Jupyter Notebook 44.04%

computer-vision detectron2 sagemaker object-detection sku110k deep-learning machine-learning pytorch

amazon-sagemaker-pytorch-detectron2's People

Contributors

Stargazers

Watchers

Forkers

amitkml glaw1300 josetak nasrine-b tekmen0 rodzanto usama-muneer pradnyil girinchutia jaedu-cho-tfs stefan-matcovici manolaz

amazon-sagemaker-pytorch-detectron2's Issues

Change Serve Dockerfile to automatically include the inference code

Hi,

I want to change the serve Dockerfile in a way that is similar to the training Dockerfile where it defines an entrypoint on its own.

The reason for this is that I do not want to pass anything in the source directory parameter when building the model so that the training artifact can be used directly.

For Example:

model = PyTorchModel(
    name="d2-model",
    model_data=training_job_artifact,
    role=role,
    sagemaker_session=sm_session,
    entry_point="predict_det2.py",
    # source_dir="container_serving",
    image_uri=serve_image_uri,
    framework_version="1.6.0",
    code_location=f"s3://{bucket}/{prefix_code}",
)

Where the Dockerfile is as this ( I have tried to add multiple lines as entrypoint which I will specify in comments):

# Build an image of Detectron2 with Sagemaker Multi Model Server: https://github.com/awslabs/multi-model-server

# using Sagemaker PyTorch container as base image
# from https://github.com/aws/sagemaker-pytorch-serving-container/

ARG REGION
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.5.1-gpu-py36-cu101-ubuntu16.04
LABEL author="[email protected]"

############# Installing latest builds ############
RUN pip install --upgrade torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# ENV FORCE_CUDA="1"
# Build D2 only for Turing (G4) and Volta (P3) architectures. Use P3 for batch transforms and G4 for inference on endpoints
# ENV TORCH_CUDA_ARCH_LIST="Turing;Volta"

# Install Detectron2
RUN pip install \
   --no-cache-dir pycocotools~=2.0.0 \
   --no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.6/detectron2-0.4%2Bcu101-cp36-cp36m-linux_x86_64.whl

# Set a fixed model cache directory. Detectron2 requirement
ENV FVCORE_CACHE="/tmp"

############# SageMaker section ##############

COPY container_serving /opt/ml/code
WORKDIR /opt/ml/code

ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM predict_sku110k.py

WORKDIR /

# ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"] # As this exists in PyTorch Inference Toolkit "artifact" folder so I thought maybe it could work
# ENTRYPOINT ["python", "predict_sku110k.py"]
ENTRYPOINT ["python", "/opt/ml/code/predict_sku110k.py"]

I created a model using the images built from these dockerfiles. The model is created successfully and also shows up on AWS sagemaker console but when I initiate a Batch Transform job it failed because it is not able to locate the predict_sku110k.py file.

Any help would be appreciated, I just want this image to create models that can directly use training artifacts. The reason I have understood so far that it doesn't is because the artifacts do not have a code directory inside them which contains this predict_sku110.py file. When the model is created however then it uploads a new artifact to S3 with the code directory which helps it do inference.

If somehow this code directory can be bundled inside the training image so that when the training artifact is created it is already there, then that would work as well I presume.

Thanks a lot,
AliButtar

Bucket hardcoded

The notebook cell below has hardcoded bucket, but this should be the users bucket:


channel_to_expected_size = {
    "training": 8215,
    "validation": 588,
    "test": 2934,
}

prefix_data = "detectron2/data"
bucket_rsr = boto3.resource("s3").Bucket("sagemaker-sku110k-dataset")

Access model checkpoint from AWS S3

Hi there!

I am trying to make predictions on a AWS Batch Job using model weights (checkpoint) stored on a S3 bucket, this way:

from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
cfg.MODEL.WEIGHTS = 's3://bucket/folder/model_final.pth'
predictor = DefaultPredictor(cfg)
outputs = predictor(side_dish_image)

However, the docker job doesn't seem to have a direct connection with the S3 bucket, and I get the following:

AssertionError: Checkpoint s3://bucket/folder/model_final.pth not found!

Is there a way I can access the checkpoint from S3? I would prefer to avoid creating an endpoint for inference (I haven't tried it yet).

EDIT: Also, I could copy it to the Docker, but that's highly inefficient (335 MB).

Best regards,
Rubén.

Question regarding incremental training

Hello, does this repo support incremental training?

Can I provide a model as a starting point for training, versus starting from scratch?

https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html

Regards,

Docker required

To push to ECR docker must be on the sagemaker instance, which it is not by default and must be installed else error docker: command not found is returned. Please can the install instructions be included

Hyperparameter Limit Constraint

Hi,

I am working on this notebook, to use detectron2's Retinanet on our custom dataset, instead of SKU110k dataset. Our dataset has 140 labels, so while running the training job this error is coming up

'hyperParameters' failed to satisfy constraint: Map value must satisfy constraint: [Member must have length less than or equal to 2500, Member must have length greater than or equal to 0, Member must satisfy regular expression pattern: .*]

Can you suggest a workaround for this issue?
TIA

deploy trained model as endpoint

I have been trying to re-adapt your code to deploy the trained model as an endpoint, but I am having some troubles when trying to send images to the endpoint to get a prediction.
A detailed explanation of the problem can be found here: https://stackoverflow.com/questions/70033312/invoke-endpoint-error-detectron2-on-aws-sagemaker-valueerror-type-applicati

Any idea of how to solve this problem, or suggestion on how to perform single-image inference rather than batch transform?

The repository with name 'pytorch-training' does not exist

When running the

%%bash
./build_and_push.sh sagemaker-d2-train-sku110k latest Dockerfile.sku110ktraining

cell there is a docker error

repository 663822598777.dkr.ecr.us-east-1.amazonaws.com/pytorch-training not found: name unknown: The repository with name 'pytorch-training' does not exist in the registry with id '663822598777'

Hyperparameter Training

Hello,

I am running hyperparameter optimization on my custom dataset. But the job is failing with Cuda memory error,

Skip loading parameter 'head.cls_score.bias' to the model due to incompatible shapes: (720,) in the checkpoint but (1278,) in the model! You might want to double check if this is expected.
Traceback (most recent call last):
File "training.py", line 472, in
train(_parse_args())
File "training.py", line 219, in train
args=(args,),
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/detectron2/modeling/backbone/fpn.py", line 142, in forward
File "/opt/conda/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "training.py", line 156, in _train_impl
trainer.train()
File "/opt/conda/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 431, in train
super().train(self.start_iter, self.max_iter)
File "/opt/conda/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 138, in train
self.run_step()
prev_features = lateral_features + top_down_features
RuntimeError: CUDA out of memory. Tried to allocate 252.00 MiB (GPU 0; 15.78 GiB total capacity; 14.03 GiB already allocated; 190.75 MiB free; 14.46 GiB reserved in total by PyTorch)

Can you elaborate on how this issue should be tackled?
TIA

Getting Unauthorized access every time I am trying to pull an image with docker

Description:

Every time I am trying to build a dockerfile, it fails at pulling the image, by throwing the error:
=> ERROR [internal] load metadata for 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.6.0-gpu 0.3s ------ > [internal] load metadata for 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04: ------ failed to solve with frontend dockerfile.v0: failed to create LLB definition: unexpected status code [manifests 1.6.0-gpu-py36-cu101-ubuntu16.04]: 401 Unauthorized

DLC image/dockerfile:

ARG REGION="eu-central-1"
#FROM python:3.8
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04

RUN pip install --upgrade torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

RUN pip install \
	--no-cache-dir pycocotools~=2.0.0 \
	--no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.6/detectron2-0.4%2Bcu101-cp36-cp36m-linux_x86_64.whl

   
ENV FORCE_CUDA="1"

ENV FVCORE_CACHE="/tmp"

COPY container_training/sku-110k /opt/ml/code
WORKDIR /opt/ml/code

ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM training.py

WORKDIR /

ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]

Current behavior:
The build fails.
Expected behavior:
The build succeeds.
Additional context:
I have to mention that I perform login before trying to build the docker file, and the build succeeds.

Sagemaker Compatibility with Detectron2

Hi,

First, thanks for the great code, It is really easy to navigate through and understand and make changes.

So the problem is I have been trying out various experiments to run detectron2 on sagemaker but for some reason, any other version than the one which is being used does not run. I tried to use torch version 1.8 and 1.9 from Deep learning containers and made appropriate changes to the Dockerfile and also changed the link to the detectron2 to try out versions 0.5 and 0.6 but the training never starts.

I also made changes in the training.py file to not use the custom trainer and the data you had but instead use the default Trainer, Evaluator and also load up data from COCO instances to make it a lot more generalized. I have been successful in running this training on the current detecton2 version and Pytorch version but for some reason when I try to upgrade to a newer one, it gives out strange errors and just fails the training job.

Can you please guide me on what could be the problem here? Is there any code inside training.py that is specific to detectron2 version 0.4 hence it only runs on that? I can provide any additional information that you require to help me debug this.

I also tried to run this repo as it is and didn't change anything except PyTorch versions and detectron2 version but it still failed.

Thanks,
AliButtar

Running into an issue with evaluating the model

I'm running your example as a demo and it keeps failing whenever I introduce evaluation to the training.

I get the following error:

TypeError: _eval_predictions() missing 1 required positional argument: 'predictions'

Traceback (most recent call last):File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrapfn(i, *args)File "/opt/conda/lib/python3.6/site-packages/detectron2/engine/launch.py", line 94, in _distributed_workermain_func(*args)File "/opt/ml/code/training.py", line 172, in _train_implTrainer.test(cfg, model, evaluator)File "/opt/conda/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 552, in testresults_i = inference_on_dataset(model, data_loader, evaluator)File "/opt/conda/lib/python3.6/site-packages/detectron2/evaluation/evaluator.py", line 182, in inference_on_datasetresults = evaluator.evaluate()File "/opt/conda/lib/python3.6/site-packages/detectron2/evaluation/coco_evaluation.py", line 175, in evaluateself._eval_predictions(predictions, img_ids=img_ids)

Have you ran into this issue?

Training Results

Hello,
I wanted to ask about the plotting of training loss, are only starting and ending losses plotted? I trained the model on a custom dataset, and got this result which is an incredibly straight downward slope.
Also, is there a way to incorporate/set epochs in the training process.

TIA