aws-samples / amazon-sagemaker-bert-pytorch Goto Github PK

View Code? Open in Web Editor NEW

63.0 5.0 25.0 307 KB

License: MIT No Attribution

Jupyter Notebook 71.20% Python 28.80%

amazon-sagemaker-bert-pytorch's Introduction

NOTICE: The new HuggingFace Deep Learning Container (DLC) is available in Amazon SageMaker (see Use Hugging Face with Amazon SageMaker). For customer training BERT models, the recommended pattern is to use HuggingFace DLC, shown as in Finetuning Hugging Face DistilBERT with Amazon Reviews Polarity dataset

This repo is no longer actively maintained.

Fine tune a PyTorch BERT model and deploy it with Elastic Inference on Amazon SageMaker

Background and Motivation

Text classification is a technique for putting text into different categories and has a wide range of applications: email providers use text classification to detect to spam emails, marketing agencies use it for sentiment analysis of customer reviews, and moderators of discussion forums use it to detect inappropriate comments.

In the past, data scientists used methods such as tf-idf, word2vec, or bag-of-words (BOW) to generate features for training classification models. While these techniques have been very successful in many NLP tasks, they don't always capture the meanings of words accurately when they appear in different contexts. Recently, we see increasing interest in using Bidirectional Encoder Representations from Transformers (BERT) to achieve better results in text classification tasks, due to its ability more accurately encode the meaning of words in different contexts.

Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK provides open source APIs and containers that make it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks. We use an Amazon SageMaker Notebook Instance for running the code. For information on how to use Amazon SageMaker Notebook Instances, see the AWS documentation.

Our customers often ask for quick fine-tuning and easy deployment of their NLP models. Furthermore, customers prefer low inference latency and low model inference cost. Amazon Elastic Inference enables attaching GPU-powered inference acceleration to endpoints, reducing the cost of deep learning inference without sacrificing performance.

The notebook in this repository demonstrates how to use Amazon SageMaker to fine tune a PyTorch BERT model and deploy it with Elastic Inference. We walk through our dataset, the training process, and finally model deployment. This work is inspired by a post by Chris McCormick and Nick Ryan.

What is BERT?

First published in November 2018, BERT is a revolutionary model. First, one or more words in sentences are intentionally masked. BERT takes in these masked sentences as input and trains itself to predict the masked word. In addition, BERT uses a "next sentence prediction" task that pre-trains text-pair representations. BERT is a substantial breakthrough and has helped researchers and data engineers across industry to achieve state-of-art results in many Natural Language Processing (NLP) tasks. BERT offers representation of each word conditioned on its context (rest of the sentence). For more information about BERT, please refer to [1].

BERT fine tuning

One of the biggest challenges data scientists face for NLP projects is lack of training data; they often have only a few thousand pieces of human-labeled text data for their model training. However, modern deep learning NLP tasks require a large amount of labeled data. One way to solve this problem is to use transfer learning.

Transfer learning is a machine learning method where a pre-trained model, such as a pre-trained ResNet model for image classification, is reused as the starting point for a different but related problem. By reusing parameters from pre-trained models, one can save significant amounts of training time and cost.

BERT was trained on BookCorpus and English Wikipedia data, which contain 800 million words and 2,500 million words, respectively [2]. Training BERT from scratch would be prohibitively expensive. By taking advantage of transfer learning, one can quickly fine tune BERT for another use case with a relatively small amount of training data to achieve state-of-the-art results for common NLP tasks, such as text classification and question answering.

Reference

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/pdf/1810.04805.pdf

[2] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

amazon-sagemaker-bert-pytorch's People

Contributors

Stargazers

Watchers

amazon-sagemaker-bert-pytorch's Issues

ValueError: framework_version or py_version was None, yet image_uri was also None. Either specify both framework_version and py_version, or specify image_uri.

The code below generates an error:

from sagemaker.pytorch.model import PyTorchModel 

pytorch_model = PyTorchModel(# model_data="<S3 location>/model.tar.gz",
                             model_data="s3://sagemaker-us-east-2-503254810580/pytorch-training-2020-11-15-14-28-24-539/model.tar.gz",
                             role=role,
                             framework_version="1.3.1",
                             py_version="py3", # https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html
                             source_dir="code",
                             entry_point="train_deploy.py")

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

Error received:

ValueError: framework_version or py_version was None, yet image_uri was also None. Either specify both framework_version and py_version, or specify image_uri.

Resolution: Adding py_version resolves the error:

from sagemaker.pytorch.model import PyTorchModel 

pytorch_model = PyTorchModel(# model_data="<S3 location>/model.tar.gz",
                             model_data="s3://sagemaker-us-east-2-503254810580/pytorch-training-2020-11-15-14-28-24-539/model.tar.gz",
                             role=role,
                             framework_version="1.3.1",
                             py_version="py3", # https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html
                             source_dir="code",
                             entry_point="train_deploy.py")

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html

Sample nonfunctional for all pytorch frameworks

Compilation with 1,6.0 works, but deployment fails, no model.pth file created in the sample dir /folder structure anticipated by Huggingface 2.3.0

Error in JSONSerializers due to new SageMaker SDK

New SageMaker SDK creates an error.

from sagemaker.predictor import json_deserializer, json_serializer

predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

Resolution:

import sagemaker.serializers

predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

a little code suggestion

https://github.com/aws-samples/amazon-sagemaker-bert-pytorch/blob/master/code/train_deploy.py#L66

train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)

I think the argument 'dataset' should be 'train_data', although it doesn't impact result, it's better readable.

When use larger instance, get error

Hi,
I want to use multi-gpu in one instance, for example, p3.8xlarge has 4 gpu. However when I change train_instance_type="ml.p3.2xlarge" to 'ml.p3.8xlarge', would get error like below:

[2020-09-02 15:22:37.902 algo-1:76 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.
2020-09-02 15:22:38,696 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
  File "train_deploy.py", line 327, in <module>
    train(parser.parse_args())
  File "train_deploy.py", line 187, in train
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 444, in forward
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 469, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1032, in forward
    inputs_embeds=inputs_embeds)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_bert.py", line 735, in forward
    embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 552, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/transformers/modeling_bert.py", line 186, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 554, in __call__
    hook_result = hook(self, input, result)
  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 153, in forward_hook
    module_name = self.module_maps[module]
KeyError: Embedding(30522, 768, padding_idx=0)

I tested the scripts and got error

I did a test run as is, but got "UnexpectedStatusException: Error for Training job pytorch-training-2020-10-27-16-28-37-955: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2".

Any help is appreciated.

This is the standout from beginning of running train_deploy.py:

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
2020-10-27 16:28:38 Starting - Starting the training job...
2020-10-27 16:28:42 Starting - Launching requested ML instances.........
2020-10-27 16:30:13 Starting - Preparing the instances for training............
2020-10-27 16:32:19 Downloading - Downloading input data...
2020-10-27 16:33:09 Training - Downloading the training image..
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Not sure training failure is related to above warnings

ValueError: got_ver is None on installing transformers

I am using the same approach for another transformers model - LayoutLMv2. However, the training job fails with the below error. It seems that transformers dependencies are not loaded properly.

UnexpectedStatusException: Error for Training job pytorch-training-2022-05-29-18-38-24-543: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --epochs 6 --num_labels 7"
Traceback (most recent call last):
  File "train_deploy.py", line 12, in <module>
    from transformers import AdamW, LayoutLMv2Processor, LayoutLMv2ForTokenClassification
  File "/opt/conda/lib/python3.6/site-packages/transformers/__init__.py", line 30, in <module>
    from . import dependency_versions_check
  File "/opt/conda/lib/python3.6/site-packages/transformers/dependency_versions_check.py", line 41, in <module>
    require_version_core(deps[pkg])
  File "/opt/conda/lib/python3.6/site-packages/transformers/utils/versions.py", line 120, in require_version_core
    return require_version(requirement, hint)
  File "/opt/conda/lib/python3.6/site-packages/transformers/utils/versions.py", line 114, in require_version
    _compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
  File "/opt/conda/lib/python3.6/site-packages/transformers/utils/versions.py", line 45, in _compare_versions

I tried running the script directly by creating a venv environment with the requirements file provided for training and it works fine.
How do I debug this issue?