Notebooks using the Hugging Face libraries ๐ค
huggingface / notebooks Goto Github PK
View Code? Open in Web Editor NEWNotebooks using the Hugging Face libraries ๐ค
License: Apache License 2.0
Notebooks using the Hugging Face libraries ๐ค
License: Apache License 2.0
Notebook: 01_getting_started_pytorch/sagemaker-notebook.ipynb
Error:
ย | 2021-04-21T16:46:12.576-07:00 | /opt/conda/bin/python3.6 train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
ย | 2021-04-21T16:46:17.578-07:00 | Traceback (most recent call last): File "train.py", line 42, in train_dataset = load_from_disk(args.training_dir) File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 781, in load_from_disk return Dataset.load_from_disk(dataset_path, fs) File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in load_from_disk state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
ย | 2021-04-21T16:46:17.578-07:00 | KeyError: '_data'
The line in Train.py failing is below
train_dataset = load_from_disk(args.training_dir)
I tried to change the notebook to have only "transformers==4.4.2" which matches with the sagemaker's huggingface docker image which is also based on 4.4.2. But still the it seems the state.json is unable to be loaded from the file system (the arrow file)
Hi,
I encountered a SIGSEGV exception while trying to run a copy of the Simple NLP Example on Colab.
The weird thing is I didn't modify the example notebook (except for some minor changes).
The only major thing is that I changed was that I used a Pytorch/XLA sample notebook which I had previously run my examples and copied the simple NLP example code cells into it.
Its really weird because running the original notebook works fine, but this copy triggers an error in
the notebook launcher...
In the question answering notebook after the cell max_answer_length = 30 this the next cell code
.......
start_indexes` = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()`
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()`
valid_answers = []`
for start_index in start_indexes:
for end_index in end_indexes:
# Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
# to part of the input_ids that are not in the context.
if (
start_index >= len(offset_mapping)
or end_index >= len(offset_mapping)
or offset_mapping[start_index] is None
or offset_mapping[end_index] is None
):
continue
........
Here start_indexes can never be more than sequence length becasue the argmax is over an array of 384 sequences i.e the max sequence length . So how does start_index >= len(offset_mapping)
& end_index >= len(offset_mapping)
make sense here ?
The hyperparameters sent by the client have an underscore in them (e.g. train_batch_size
), whereas those received by the argparser have a hyphen (e.g. train-batch-size
). Therefore, values do not get propagated through the train.py
file.
I have tested the solution on these files
notebooks/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb
notebooks/sagemaker/01_getting_started_pytorch/scripts/train.py
but I suspect we'll have to update train.py
in the following folders as well - 05_spot_instances
, 06_sagemaker_metrics
In the train.py
file, swap these lines -
parser.add_argument("--train-batch-size", type=int, default=32)
parser.add_argument("--eval-batch-size", type=int, default=64)
parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
with these
parser.add_argument("--train_batch_size", type=int, default=32)
parser.add_argument("--eval_batch_size", type=int, default=64)
parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
Hello,
I successfully ran Jupyter notebook using Sagemaker Huggingface estimator hosted here. After training completed (sequence classification on imdb dataset), I was able to see the following artifacts in s3:
$ aws s3 ls s3://sagemaker-us-east-1-135890****/**** --recursive
2021-05-28 00:18:46 0 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/claim.smd
2021-05-28 00:18:47 3424 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/collections/
2021-05-28 00:18:47 97 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/events/
2021-05-28 00:18:47 233 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/
2021-05-28 00:29:24 0 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/training_job_end.ts
2021-05-28 00:29:14 954594573 huggingface-pytorch-training-2021-05-28-00-12-07-095/output/model.tar.gz
2021-05-28 00:29:22 313 huggingface-pytorch-training-2021-05-28-00-12-07-095/output/output.tar.gz
2021-05-28 00:27:17 156 huggingface-pytorch-training-2021-05-28-00-12-07-095/rule-output/ProfilerReport-1622160727/profiler-output/
2021-05-28 00:12:09 1447 huggingface-pytorch-training-2021-05-28-00-12-07-095/source/sourcedir.tar.gz
(session_analyzer_env)
I do not see logs being captured anywhere. How do I run tensorboard / wandb to visualize training metrics?
Hi, Thank you for awesome example notebooks!
I'm reviewing the notebook about question answering.
https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb
In the notebook, you used postprocess_qa_predictions
function to post-process the prediction of model.
(The function is defined at 25th cell in the notebook.)
And the function initializes valid_answers
list(local var) for each example, not feature.
However, I thought that valid_answers
should be initialized for each feature, since certain valid answer for one feature's context might not exist in another feature's context.
Is there anything that I misundertood?
Thank you
Key | Value1 | Value2 |
---|---|---|
Instance count | 1 | 2 |
Instance types | p3.2xlarge | p3dn.24xlarge |
Models | bert-base-uncased | bert-large-uncased-whole-word-masking |
batch_size | 2 | 8 |
distributions | horovod | smddp |
Huggingface - 2.4.1
Transformer - 4.5.0
DLC - 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-tensorflow-training:2.4.1-transformers4.5.0-gpu-py37-cu110-ubuntu18.04
Nodes | Instance Type | bert-base | bert-large |
---|---|---|---|
1 | p3.2xlarge | success | OOM |
2 | p3.2xlarge | success | OOM |
1 | p3dn.24xlarge | success | OOM |
2 | p3dn.24xlarge | success | OOM |
Independent of distributed training strategy, instance-type, instance-count, TF2.4.1 + HF bert-large suffers from OOM
1,13]<stderr>:2021-05-05 22:35:24.070037: W tensorflow/core/common_runtime/bfc_allocator.cc:433] Allocator (GPU_0_bfc) ran out of memory trying to allocate 32.00MiB (rounded to 33554432)requested by op tf_bert_for_sequence_classification/bert/
encoder/layer_._15/attention/self/transpose_3
[1,13]<stderr>:Current allocation summary follows.
[1,13]<stderr>:2021-05-05 22:35:24.071150: W tensorflow/core/common_runtime/bfc_allocator.cc:441] ****************************************************************************************************
[1,13]<stderr>:2021-05-05 22:35:24.071187: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at transpose_op.cc:184 : Resource exhausted: OOM when allocating tensor with shape[16,512,16,64] and type float on /job:localhost/repli
ca:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[1,13]<stderr>:Traceback (most recent call last):
[1,13]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,13]<stderr>: "__main__", mod_spec)
[1,13]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,13]<stderr>: exec(code, run_globals)
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stderr>: main()
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stderr>: run_command_line(args)
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stderr>: run_path(sys.argv[0], run_name='__main__')
[1,13]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 263, in run_path
[1,13]<stderr>: pkg_name=pkg_name, script_name=fname)
[1,13]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 96, in _run_module_code
[1,13]<stderr>: mod_name, mod_spec, pkg_name, script_name)
[1,13]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,13]<stderr>: exec(code, run_globals)
[1,13]<stderr>: File "train_bert.py", line 242, in <module>
[1,13]<stderr>: main()
[1,13]<stderr>: File "train_bert.py", line 205, in main
[1,13]<stderr>: verbose=1 if hvd.rank() == 0 else 0,
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1129, in fit
[1,13]<stderr>: tmp_logs = self.train_function(iterator)
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
[1,13]<stderr>: result = self._call(*args, **kwds)
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
[1,13]<stderr>: return self._stateless_fn(*args, **kwds)
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
[1,13]<stderr>: filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
[1,13]<stderr>: ctx, args, cancellation_manager=cancellation_manager))
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 560, in call
[1,13]<stderr>: ctx=ctx)
[1,13]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
[1,13]<stderr>: inputs, attrs, num_outputs)
[1,13]<stderr>:tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,512,16,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[1,13]<stderr>:#011 [[node tf_bert_for_sequence_classification/bert/encoder/layer_._15/attention/self/transpose_3 (defined at /usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py:279) ]]
[1,13]<stderr>:Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[1,13]<stderr>: [Op:__inference_train_function_54347]
[1,13]<stderr>:
[1,13]<stderr>:Errors may have originated from an input operation.
[1,13]<stderr>:Input Source operations connected to node tf_bert_for_sequence_classification/bert/encoder/layer_._15/attention/self/transpose_3:
[1,13]<stderr>: tf_bert_for_sequence_classification/bert/encoder/layer_._15/attention/self/MatMul_1 (defined at /usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py:278)
[1,13]<stderr>:
[1,13]<stderr>:Function call stack:
[1,13]<stderr>:train_function
[1,13]<stderr>:
Hi,
When I ran deploy_transformer_model_from_hf_hub.ipynb on AWS Sagemaker notebook instance with conda_pytorch_p36 kernel, I got the following error message for "from sagemaker.huggingface import HuggingFaceModel" command:
ImportError: cannot import name 'HuggingFaceModel'
The official website says that I should use "from sagemaker.huggingface.model import HuggingFaceModel" instead of "from
sagemaker.huggingface import HuggingFaceModel" which is used in "deploy_transformer_model_from_hf_hub.ipynb". See the following two resources for your reference:
I just want to inform you that I have already updated "sagemaker" package in the Notebook instance by running "pip install sagemaker --upgrade"
After using "from sagemaker.huggingface.model import HuggingFaceModel", I got rid of the error message mentioned above and I was able to run the following code block:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
hub = {
'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
'HF_TASK':'question-answering' # NLP task you want to use for predictions
}
huggingface_model = HuggingFaceModel(
env=hub,
role=role,
transformers_version="4.6",
pytorch_version="1.7",
py_version="py36",
)
However, I couldn't run the following code block which generated an error message, which I've attached in a doc file.
predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.m5.xlarge" )
I'd appreciate your help.
Best,
Farshad
Hello, I am running this notebook Question Answering on SQUAD using Colab: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb
I got the IndexError in this step, could you please have a look how to fix it? Thanks!
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)
This is the output:
Post-processing 10570 example predictions split into 10784 features.
9%
1000/10570 [00:02<00:22, 420.59it/s]
IndexError Traceback (most recent call last)
in ()
----> 1 final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)in postprocess_qa_predictions(examples, features, raw_predictions, n_best_size, max_answer_length)
57 continue
58
---> 59 start_char = offset_mapping[start_index][0]
60 end_char = offset_mapping[end_index][1]
61 valid_answers.append(IndexError: list index out of range
Hello @yjernite
Thanks for providing a nice tutorial on training an unsupervised retriever here.
I was wondering if you can provide instruction on how to modify this snippet of your code, to be able to start from a pre-trained dpr model (DPRContextEncoder
) rather than the distilled Bert model. So the dpr context encoder doesn't have the embeddings, should we just use the encoder on one mini-batch at a time and skip running the embeddings layer on everything at once
for doing checkpointing?
Thanks
Hi,
I am trying to replicate the notebook on this path : https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb
But in my run, instead of selecting the model using mathews correlation, its selecting a model based on the epoch running time. For example, look at the log printed in the notebook which I run: Trial 0 finished with value: 1645.5148768624724 and parameters: {'learning_rate': 2.0970346847322057e-05, 'num_train_epochs': 5, 'seed': 35, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 1645.5148768624724.
Interestingly, the tables printed in my notebook have two additional columns: RunTime and SamplesPerSeconds. My guess is the library code is picking up the last column of the dataframe and not the one with the name of the metric. Adding a snapshot of the table below.
Epoch | Training Loss | Validation Loss | Matthews Correlation | Runtime | Samples Per Second |
---|---|---|---|---|---|
1 | No log | 0.482555 | 0.435778 | 0.631800 | 1650.953000 |
2 | 0.450100 | 0.494479 | 0.488171 | 0.632700 | 1648.565000 |
3 | 0.450100 | 0.574674 | 0.510249 | 0.637900 | 1635.140000 |
4 | 0.218400 | 0.637276 | 0.519209 | 0.643200 | 1621.627000 |
5 | 0.218400 | 0.680201 | 0.520577 | 0.634300 | 1644.360000 |
Can someone please help me with whether I am missing something?
Hi,
I'm missing the definition of the tf_train_dataset variable here.
Best,
Florian
I extended the training and evaluation process here https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-native-pytorch-tensorflow to save the fine-tuned model and use it for prediction separately. Here is the code for it.
true_labels, predicted_labels = [], []
model.eval()
for batch in eval_dataloader:
batch_labels = batch['labels'].numpy()
true_labels.extend(batch_labels)
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
batch_predictions = predictions.to('cpu').numpy()
predicted_labels.extend(batch_predictions)
model.save_pretrained('imdb_custom_dataset')
from transformers import AutoModel
model = AutoModel.from_pretrained("/content/imdb_custom_dataset")
When I try to predict using the loaded model I encounter this error AttributeError: 'BaseModelOutput' object has no attribute 'logits' . The code used for it is below.
model.eval()
for batch in test_dataloader:
break
test_sample = {k: v for k, v in batch.items() if k != 'labels'}
outputs_sample = model(**test_sample)
logits_sample = outputs_sample.logits
Error details:
AttributeError Traceback (most recent call last)
<ipython-input-20-a99e37f72baa> in <module>()
4 test_sample = {k: v for k, v in batch.items() if k != 'labels'}
5 outputs_sample = model(**test_sample)
----> 6 logits_sample = outputs_sample.logits
AttributeError: 'BaseModelOutput' object has no attribute 'logits'
Any help on this issue ?
Thank you
I had created a notebook which Translates English to almost 130+ with Helsinki NLP's opus-mt-en-mul which works similar to google translate. Instead of loading a specific model for every language pair, a multilingual model is used here.
Consider adding this to examples repo if needed as it would be easy for beginners exploring multilingual machine translation
Hi,
I was trying to deploy wav2vec fine-tuned model to AWS sagemaker but it seems that the automatic-speech-recognition task has not been implemented yet.
Any clue how I can perform a prediction to a huggingface wav2vec model? I have successfully deployed the model and created an Endpoint.
Thanks
Hello!
Reopening an issue connected to this thread here:
Thank you HuggingFace team for all you do! This summer I have been working from this notebook when I noticed a gap I will discuss below. PS - this is my first GitHub issue, if you have any feedback.
Relevant Commits
notebooks/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb
notebooks/sagemaker/01_getting_started_pytorch/scripts/train.py
parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
with these
parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
As more and more users are using HuggingFace along with Amazon SageMaker, we are seeing a need to make sure the example notebooks available in this repo are working correctly and intercept any bugs if any pro-actively so that we can identify them before the end users starts experiencing the problems. We need an automated mechanism to regression test these notebooks by periodically executing all the notebooks and report if there are any errors. The errors should create a new issue to be resolved.
This issue is to capture discussion around the practicality of implementing automated testing on this repository. Any thoughts would be greatly appreciated!
@philschmid - If you like to discuss more on this topic.
Hi @sgugger , I'm a beginner to Huggingface, I really love your tutorial which is best course I've ever seen in AI.
However, I got a little confused in the tutorial "Fine-tuning a pretrained model-A full training" part (https://huggingface.co/course/chapter3/4?fw=pt),
there mentioned:
# Rename the column label to labels (because the model expects the argument to be named labels).
...
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
...
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
...
I don't think we have to manually rename the "label" to "labels", since in the source code of data_collator.py
, there is:
class DataCollatorWithPadding:
...
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
...
if "label" in batch:
batch["labels"] = batch["label"]
del batch["label"]
if "label_ids" in batch:
batch["labels"] = batch["label_ids"]
del batch["label_ids"]
return batch
where the column "lable" has already been changed to "labels".
I have tested the version WITHOUT the line below:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
And found that the "label" has been automatically changed to "labels":
tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2','idx'])
# tokenized_datasets = tokenized_datasets.rename_column('label','labels')
tokenized_datasets.set_format('torch')
print(tokenized_datasets['train'].column_names)
output: ['attention_mask', 'input_ids','label', 'token_type_ids']
from torch.utils.data import DataLoader, Dataset
train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=8, collate_fn=data_collator)
for batch in train_dataloader:
break
{k: v.shape for k, v in batch.items()}
output: {'attention_mask': torch.Size([8, 65]),
'input_ids': torch.Size([8, 65]),
'token_type_ids': torch.Size([8, 65]),
'labels': torch.Size([8])}
That is, "label" has been automatically changed to "labels" by the data_collator.
Hi all & thanks for the examples!
I see that in SageMaker notebook 6 some metrics are set up as follows:
metric_definitions=[
{'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]
I had an issue today with the way learning_rate
was picking up (missing the exponent) and realised that the RegEx's used here probably aren't doing what's intended.
.123
or an exponent e-123
, not both (e.g. in a learning rate like 1.03e-6
42
)r"'loss': ...\..."
If careful validation is the aim, maybe it could be something more like the following?
r"'my_metric': (-?[0-9]+(\.[0-9]+)?(e[-+]?[0-9]+)?),?"
...Or perhaps we could be a little more concise and trusting on the numbers, with something like the below?
r"'my_metric': ([-+0-9e.]+)[,}]"
As long as the expression is able to articulate what comes immediately after the number (always comma or close brace as far as I can tell?), we could even be super lazy and e.g. 'my_metric': (.*?)[,}]
.
Hello,
I can not open the notebook language_modeling.ipynb.
Instead, the message "An error occurred" is displayed.
cc @sgugger
Pierre
Hi,
I tried to succeed this tutorial
https://github.com/huggingface/notebooks/blob/master/examples/onnx-export.ipynb
but I just get some error like below.. what should I do to solve this problem..?
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-23-06a1d0b2a7b7> in <module>()
7 # opt_options.enable_embed_layer_norm = False
8
----> 9 optimized_model = optimizer.optimize_model("onnx/bert-base-cased.onnx", model_type='bert', num_heads=12, hidden_size=768)
10 optimized_model.save_model_to_file('bert.opt.onnx')
8 frames
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/optimizer.py in optimize_model(input, model_type, num_heads, hidden_size, optimization_options, opt_level, use_gpu, only_onnxruntime)
310
311 if not only_onnxruntime:
--> 312 optimizer.optimize(optimization_options)
313
314 # Remove the temporary model.
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/onnx_model_bert.py in optimize(self, options, add_dynamic_axes)
277
278 if (options is None) or options.enable_skip_layer_norm:
--> 279 self.fuse_skip_layer_norm()
280
281 if (options is None) or options.enable_attention:
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/onnx_model_bert.py in fuse_skip_layer_norm(self)
103
104 def fuse_skip_layer_norm(self):
--> 105 fusion = FusionSkipLayerNormalization(self)
106 fusion.apply()
107
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/fusion_skiplayernorm.py in __init__(self, model)
19 def __init__(self, model: OnnxModel):
20 super().__init__(model, "SkipLayerNormalization", "LayerNormalization")
---> 21 self.shape_infer_helper = self.model.infer_runtime_shape({"batch_size": 4, "seq_len": 7})
22
23 def fuse(self, node, input_name_to_nodes, output_name_to_node):
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/onnx_model.py in infer_runtime_shape(self, dynamic_axis_mapping, update)
34 shape_infer_helper = self.shape_infer_helper
35
---> 36 if shape_infer_helper.infer(dynamic_axis_mapping):
37 return shape_infer_helper
38 return None
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/shape_infer_helper.py in infer(self, dynamic_axis_mapping)
33 self._preprocess(self.model_)
34 while self.run_:
---> 35 self.all_shapes_inferred_ = self._infer_impl()
36
37 self.inferred_ = True
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/../symbolic_shape_infer.py in _infer_impl(self, start_sympy_data)
1301 in_dims = [s[len(s) - out_rank + d] for s in in_shapes if len(s) + d >= out_rank]
1302 if len(in_dims) > 1:
-> 1303 self._check_merged_dims(in_dims, allow_broadcast=True)
1304
1305 for i_o in range(len(node.output)):
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/../symbolic_shape_infer.py in _check_merged_dims(self, dims, allow_broadcast)
527 dims = [d for d in dims if not (is_literal(d) and int(d) <= 1)]
528 if not all([d == dims[0] for d in dims]):
--> 529 self._add_suggested_merge(dims, apply=True)
530
531 def _compute_matmul_shape(self, node, output_dtype=None):
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/../symbolic_shape_infer.py in _add_suggested_merge(self, symbols, apply)
156
157 def _add_suggested_merge(self, symbols, apply=False):
--> 158 assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols])
159 symbols = set(symbols)
160 for k, v in self.suggested_merge_.items():
AssertionError:
Hi, firstly, thank you for this very useful resource!
I was adapting the text classification example for my own data, but was having trouble figuring out how to:
Could you direct me to some resources to find out more about these? (Including these in the notebook might be useful too to newbies like myself.)
Hi
The following is about the description part, not the code in sample notebooks.
Please fix when updating
local-gpu โ local_gpu
I tried to fine-tune mT5 for English->Myanmar translation from Tatoeba-Challenge Dataset. I followed to train this notebook example of en-ro translation. And I used model_checkpoint as "google/mt5-small". I tested 1~4 epoch training.
The following is the training parameters, I reduced the batch_size as 4.
batch_size=4
args = Seq2SeqTrainingArguments(
"mt5-translate-en-my",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=1,
predict_with_generate=True,
fp16=True, )
I got "NaN" in training loss and validation loss as below:
Can you please help me how do I do it? Thanks in advance.
I'm using the zeroshot pipeline with the valhalla/distilbart-mnli-12-9
model. How do I enable multi_class classification? When using the transformer w/ pytorch in python, I pass the argument multi_class=True
, but I can't find the appropriate way to do this in Sagemaker. See code below:
from sagemaker.huggingface.model import HuggingFaceModel
# Hub Model configuration. <https://huggingface.co/models>
model = 'valhalla/distilbart-mnli-12-9'
hub = {
'HF_MODEL_ID': model, # model_id from hf.co/models
'HF_TASK':'zero-shot-classification' # NLP task you want to use for predictions,
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
env=hub, # configuration for loading model from Hub
role=role, # iam role with permissions to create an Endpoint
transformers_version="4.6", # transformers version used
pytorch_version="1.7", # pytorch version used
py_version="py36"
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.p2.xlarge",
multi_label= True
)
I'm working through the notebook noted in the title, in Sagemaker Studio.
Under the heading "Fine-tuning & starting Sagemaker Training Job" the block of code there throws an error, about not being able to find HuggingFace
. The following import resolve the error.
from sagemaker.huggingface import HuggingFace
minor issue, just letting you know.
Running the chapter 7 notebooks on a GPU Colab instance throws the following error during training:
RuntimeError: Failed to import transformers.training_args because of the following error (look up to see its traceback):
/usr/local/lib/python3.7/dist-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN2at13_foreach_erf_EN3c108ArrayRefINS_6TensorEEE
Could this be related to the installation of Pytorch from https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl introduced in 2dedfdf?
Note that the notebooks are installing transformers==4.12.5
and torch @ https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl
In the summarization notebook, where/when do we set the device? Are parallel gpus expected? Two things that could help: i) specify where we could set the device and call model.to(device) and ii) explicate where the model might expect data in parallel e.g. how setting batched=True in the pre-processing or how DataCollatorForSeq2Seq expects tensors.
I have some notebooks illustrating the use of transformers, may I make a PR?
Longform QA notebook uses the wiki40b dataset, which is huge to download and work with. I couldn't get other alternative datasets to work with it as the model expects the format of wiki40b.
What should be the correct approach to get any text corpus to work with this notebook?
In the Huggingface Sagemaker-sdk - Getting Started Demo, when I run the load dataset cell I get the error pasted below. I am running the notebook in SageMaker Studio, I have tried both the Data Science and PyTorch 1.6 kernels.
NonMatchingSplitsSizesError Traceback (most recent call last)
in
1 # load dataset
----> 2 dataset = load_dataset(dataset_name)
3
4 # download tokenizer
5 tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
/opt/conda/lib/python3.6/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, **config_kwargs)
749 try_from_hf_gcs=try_from_hf_gcs,
750 base_path=base_path,
--> 751 use_auth_token=use_auth_token,
752 )
753
/opt/conda/lib/python3.6/site-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
573 if not downloaded_from_gcs:
574 self._download_and_prepare(
--> 575 dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
576 )
577 # Sync info
/opt/conda/lib/python3.6/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
660
661 if verify_infos:
--> 662 verify_splits(self.info.splits, split_dict)
663
664 # Update the info object with the splits.
/opt/conda/lib/python3.6/site-packages/datasets/utils/info_utils.py in verify_splits(expected_splits, recorded_splits)
72 ]
73 if len(bad_splits) > 0:
---> 74 raise NonMatchingSplitsSizesError(str(bad_splits))
75 logger.info("All the splits matched successfully.")
76
NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='test', num_bytes=32660064, num_examples=25000, dataset_name='imdb'), 'recorded': SplitInfo(name='test', num_bytes=9982987, num_examples=7726, dataset_name='imdb')}, {'expected': SplitInfo(name='train', num_bytes=33442202, num_examples=25000, dataset_name='imdb'), 'recorded': SplitInfo(name='train', num_bytes=0, num_examples=0, dataset_name='imdb')}, {'expected': SplitInfo(name='unsupervised', num_bytes=67125548, num_examples=50000, dataset_name='imdb'), 'recorded': SplitInfo(name='unsupervised', num_bytes=0, num_examples=0, dataset_name='imdb')}]
Hi,
I'm trying to follow the tutorial on text classification, however, when I call load_metrics(), it throws out the following error message:
AttributeError Traceback (most recent call last)
in
1 actual_task = "mnli" if task == "mnli-mm" else task
----> 2 metric = load_metric('glue', actual_task)
3 metric
~/.local/lib/python3.6/site-packages/datasets/load.py in load_metric(path, config_name, process_id, num_process, cache_dir, experiment_id, keep_in_memory, download_config, download_mode, script_version, **metric_init_kwargs)
498 dataset=False,
499 )
--> 500 metric_cls = import_main_class(module_path, dataset=False)
501 metric = metric_cls(
502 config_name=config_name,
~/.local/lib/python3.6/site-packages/datasets/load.py in import_main_class(module_path, dataset)
64 """
65 importlib.invalidate_caches()
---> 66 module = importlib.import_module(module_path)
67
68 if dataset:
/usr/lib/python3.6/importlib/init.py in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
127
128
/usr/lib/python3.6/importlib/_bootstrap.py in _gcd_import(name, package, level)
/usr/lib/python3.6/importlib/_bootstrap.py in find_and_load(name, import)
/usr/lib/python3.6/importlib/_bootstrap.py in find_and_load_unlocked(name, import)
/usr/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)
/usr/lib/python3.6/importlib/_bootstrap_external.py in exec_module(self, module)
/usr/lib/python3.6/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)
~/.cache/huggingface/modules/datasets_modules/metrics/glue/e4606ab9804a36bcd5a9cebb2cb65bb14b6ac78ee9e6d5981fa679a495dd55de/glue.py in
103
104
--> 105 @datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
106 class Glue(datasets.Metric):
107 def _info(self):
AttributeError: module 'datasets.utils.file_utils' has no attribute 'add_start_docstrings'
I was able to successfully reproduce the fine-tuning process a month ago but got the error above today. The code are completely the same as the notebook. Any ideas on what might go wrong? Thanks a lot!
You can add open on colab option to every notebook by simply adding a cell with the code:
`<td>
<a target="_blank" href="https://colab.research.google.com/PUT GITHUB URL HERE">
<img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
Run in Google Colab</a>
</td>`
The image is the logo of colab, this can work without the image:
`<td>
<a target="_blank" href="https://colab.research.google.com/PUT GITHUB URL HERE">
<img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
Run in Google Colab</a>
</td>`
I'm trying to run '01_getting_started_pytorch' tutorial with no luck from a sagemaker instance. I tried the original notebook with no luck and then adapted code from here but it's still a no go (notebook below). anyone have any hints on what I'm doing wrong?
hi here,
I got the following error when run trainer.hyperparameter_search() on databricks:
RuntimeError: CUDA out of memory. Tried to allocate 300.00 MiB (GPU 0; 11.17 GiB total capacity; 10.18 GiB already allocated; 274.44 MiB free; 10.50 GiB reserved in total by PyTorch)
my dataset is very small, 60 sentences in total, training epoch =20, training batch and eval batch are 8 and 2 respectively
after the trial 5, I got this message, any idea to solve this problem?
hello!
Great code and notebook, and I really love ELI5.
Just want to signal that it seems that if I'm not mistaken, the model was trained on 2 * n_results
without minimum length filtering, instead of n_results
. This is likely not a big deal.
Seems we are using the document generated in https://github.com/huggingface/notebooks/blob/master/longform-qa/lfqa_utils.py#L595 in the training set, specifically, support_doc
https://github.com/huggingface/notebooks/blob/master/longform-qa/lfqa_utils.py#L601
However, we never limit the number of results in support_doc
to n_results
(it should be 2 * n_results
from the index.search
call), and we never exclude the results that are shorter than the minimum length.
We do for res_list
, but it's not that that get's used at training time:
from the notebook, we can see that src_ls
gets ignored.
eli5_train_docs = json.load(open('precomputed/eli5_train_precomputed_dense_docs.json'))
eli5_valid_docs = json.load(open('precomputed/eli5_valid_precomputed_dense_docs.json'))
s2s_train_dset = ELI5DatasetS2S(eli5['train_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_train_docs]))
s2s_valid_dset = ELI5DatasetS2S(eli5['validation_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_valid_docs]), training=False)
And later
question_doc = "question: {} context: {}".format(question, doc)
Anyways, just wanted to let you know
On a 2node p3dn.24xlarge instances, I find a OOM issue while trying to load a pre-trained bert-large-uncased-whole-word-masking
model
Script
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained('bert-large-uncased-whole-word-masking')
SM Launcher
import sagemaker
from sagemaker.huggingface import HuggingFace
role = sagemaker.get_execution_role()
distribution={'mpi': {'enabled':True,"custom_mpi_options":"-verbose --NCCL_DEBUG=INFO -x RDMAV_FORK_SAFE=1"}}
# instance configurations
instance_type='ml.p3dn.24xlarge'
instance_count=2
huggingface_estimator = HuggingFace(
entry_point='model_load_hf_bert.py',
source_dir='.',
instance_type=instance_type,
role=role,
instance_count=instance_count,
transformers_version='4.5.0',
tensorflow_version='2.4.1',
py_version='py37',
distribution=distribution,
debugger_hook_config=False, # currently needed
)
huggingface_estimator.fit()
Nodes | Instance Type | Result |
---|---|---|
1 | p3.2xlarge | success |
2 | p3.2xlarge | success |
1 | p3dn.24xlarge | OOM |
2 | p3dn.24xlarge | OOM |
This issue is observed
Specifically the line where it fails
For a detailed Stack Trace
[1,14]<stderr>:2021-05-04 22:39:11.526981: F ./tensorflow/core/kernels/random_op_gpu.h:232] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory
[1,15]<stderr>:2021-05-04 22:39:21.984104: W tensorflow/core/common_runtime/bfc_allocator.cc:431] Allocator (GPU_0_bfc) ran out of memory trying to allocate 119.23MiB (rounded to 125018112)requested by op TruncatedNormal
[1,15]<stderr>:Current allocation summary follows.
[1,15]<stderr>:2021-05-04 22:39:21.984204: W tensorflow/core/common_runtime/bfc_allocator.cc:439] *___________________________________________________________________________________________________
[1,15]<stderr>:2021-05-04 22:39:21.985568: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at random_op.cc:77 : Resource exhausted: OOM when allocating tensor with shape[30522,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[1,15]<stderr>:Traceback (most recent call last):
[1,15]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,15]<stderr>: "__main__", mod_spec)
[1,15]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,15]<stderr>: exec(code, run_globals)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,15]<stderr>: main()
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 196, in main
[1,15]<stderr>: run_command_line(args)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,15]<stderr>: run_path(sys.argv[0], run_name='__main__')
[1,15]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 263, in run_path
[1,15]<stderr>: pkg_name=pkg_name, script_name=fname)
[1,15]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 96, in _run_module_code
[1,15]<stderr>: mod_name, mod_spec, pkg_name, script_name)
[1,15]<stderr>: File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,15]<stderr>: exec(code, run_globals)
[1,15]<stderr>: File "hf_bert_public.py", line 125, in <module>
[1,15]<stderr>: model = TFAutoModelForSequenceClassification.from_pretrained(args.model_name)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/transformers/models/auto/auto_factory.py", line 360, in from_pretrained
[1,15]<stderr>: pretrained_model_name_or_path, *model_args, config=config, **kwargs
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/transformers/modeling_tf_utils.py", line 1271, in from_pretrained
[1,15]<stderr>: model(model.dummy_inputs) # build the network with dummy inputs
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 999, in __call__
[1,15]<stderr>: outputs = call_fn(inputs, *args, **kwargs)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py", line 1450, in call
[1,15]<stderr>: training=inputs["training"],
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 999, in __call__
[1,15]<stderr>: outputs = call_fn(inputs, *args, **kwargs)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py", line 650, in call
[1,15]<stderr>: training=inputs["training"],
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 996, in __call__
[1,15]<stderr>: self._maybe_build(inputs)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2688, in _maybe_build
[1,15]<stderr>: self.build(input_shapes) # pylint:disable=not-callable
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py", line 152, in build
[1,15]<stderr>: initializer=get_initializer(self.initializer_range),
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 616, in add_weight
[1,15]<stderr>: caching_device=caching_device)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 750, in _add_variable_with_custom_getter
[1,15]<stderr>: **kwargs_for_getter)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 145, in make_variable
[1,15]<stderr>: shape=variable_shape if variable_shape else None)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
[1,15]<stderr>: return cls._variable_v1_call(*args, **kwargs)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
[1,15]<stderr>: shape=shape)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
[1,15]<stderr>: previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2597, in default_variable_creator
[1,15]<stderr>: shape=shape)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
[1,15]<stderr>: return super(VariableMetaclass, cls).__call__(*args, **kwargs)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1518, in __init__
[1,15]<stderr>: distribute_strategy=distribute_strategy)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1651, in _init_from_args
[1,15]<stderr>: initial_value() if init_from_fn else initial_value,
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/initializers/initializers_v2.py", line 342, in __call__
[1,15]<stderr>: return super(TruncatedNormal, self).__call__(shape, dtype=_get_dtype(dtype))
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/init_ops_v2.py", line 450, in __call__
[1,15]<stderr>: self.stddev, dtype)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/init_ops_v2.py", line 1053, in truncated_normal
[1,15]<stderr>: shape=shape, mean=mean, stddev=stddev, dtype=dtype, seed=self.seed)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
[1,15]<stderr>: return target(*args, **kwargs)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/random_ops.py", line 196, in truncated_normal
[1,15]<stderr>: shape_tensor, dtype, seed=seed1, seed2=seed2)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/gen_random_ops.py", line 902, in truncated_normal
[1,15]<stderr>: _ops.raise_from_not_ok_status(e, name)
[1,15]<stderr>: File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
[1,15]<stderr>: six.raise_from(core._status_to_exception(e.code, message), None)
[1,15]<stderr>: File "<string>", line 3, in raise_from
[1,15]<stderr>:tensorflow.python.framework.errors_impl.ResourceExhaustedError[1,15]<stderr>:: OOM when allocating tensor with shape[30522,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TruncatedNormal]
When executing the transformers pipeline in a Jupyter Notebook, I have had couple of times issues with the progress bar showing the download of a NLP model. The error thrown was an Import Error showing troubles with IPygress and the Jupyter Widgets and Extensions.
Is there a way to make the code more generalized: https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb
is it necessary to have already the questions and answers encoded beforehand.
Can you have the questions not put in beforehand
code in function postprocess_qa_predictions
if min_null_score is None or min_null_score < feature_null_score:
min_null_score = feature_null_score
I think <
should be replaced by >
For the notebook: question_answering.ipynb
as you mentioned in a commented section: for a more general case, we will need to match sample_id to an example index.
For the example you provided did you know what part of the dictionary the answer was in?
I am working with a txt document and I want to find a general way to find an answer based on a question within the txt document?
When the train.py
script is used as is, the model doesn't train successfully on a SageMaker notebook instance, when using the built in conda_tensorflow2_p36
conda environment. This seems to be due to the dataset not being shuffled. The model always outputs LABEL_0
, and achieves a test accuracy of 50%.
Adding:
train_dataset = train_dataset.shuffle()
at line 46 seems to solve this issue. Upon retraining the model functions as expected, and achieves a test accuracy of 89.53%.
Hi All,
I am trying to replicate the attach code and still getting the above error. Can you suggest any solution?
Error:- Error for Training job huggingface-pytorch-training-2022-01-25-19-23-38-888: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise EnvironmentError(msg)
LOG:-
2022-01-25 19:26:23 Training - Downloading the training image....................bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-01-25 19:29:48,286 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-01-25 19:29:48,307 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-01-25 19:29:51,328 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-01-25 19:29:51,774 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {
"test": "/opt/ml/input/data/test",
"train": "/opt/ml/input/data/train"
},
"current_host": "algo-1",
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"hub_token": null,
"model_id": "distilbert-base-uncased",
"eval_batch_size": 20,
"train_batch_size": 10,
"push_to_hub": true,
"hub_model_id": "sagemaker-distilbert-emotion",
"epochs": 1,
"learning_rate": 3e-05,
"hub_strategy": "every_save",
"fp16": true
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"test": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
},
"train": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "huggingface-pytorch-training-2022-01-25-19-23-38-888",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-eu-west-2-352316401451/huggingface-pytorch-training-2022-01-25-19-23-38-888/source/sourcedir.tar.gz",
"module_name": "train",
"network_interface_name": "eth0",
"num_cpus": 8,
"num_gpus": 1,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"hosts": [
"algo-1"
],
"network_interface_name": "eth0"
},
"user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"epochs":1,"eval_batch_size":20,"fp16":true,"hub_model_id":"sagemaker-distilbert-emotion","hub_strategy":"every_save","hub_token":null,"learning_rate":3e-05,"model_id":"distilbert-base-uncased","push_to_hub":true,"train_batch_size":10}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["test","train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-2-352316401451/huggingface-pytorch-training-2022-01-25-19-23-38-888/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":1,"eval_batch_size":20,"fp16":true,"hub_model_id":"sagemaker-distilbert-emotion","hub_strategy":"every_save","hub_token":null,"learning_rate":3e-05,"model_id":"distilbert-base-uncased","push_to_hub":true,"train_batch_size":10},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2022-01-25-19-23-38-888","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-2-352316401451/huggingface-pytorch-training-2022-01-25-19-23-38-888/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--epochs","1","--eval_batch_size","20","--fp16","True","--hub_model_id","sagemaker-distilbert-emotion","--hub_strategy","every_save","--hub_token","","--learning_rate","3e-05","--model_id","distilbert-base-uncased","--push_to_hub","True","--train_batch_size","10"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_HUB_TOKEN=
SM_HP_MODEL_ID=distilbert-base-uncased
SM_HP_EVAL_BATCH_SIZE=20
SM_HP_TRAIN_BATCH_SIZE=10
SM_HP_PUSH_TO_HUB=true
SM_HP_HUB_MODEL_ID=sagemaker-distilbert-emotion
SM_HP_EPOCHS=1
SM_HP_LEARNING_RATE=3e-05
SM_HP_HUB_STRATEGY=every_save
SM_HP_FP16=true
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 20 --fp16 True --hub_model_id sagemaker-distilbert-emotion --hub_strategy every_save --hub_token --learning_rate 3e-05 --model_id distilbert-base-uncased --push_to_hub True --train_batch_size 10
2022-01-25 19:30:06 Uploading - Uploading generated training model
2022-01-25 19:30:06 Failed - Training job failed
ProfilerReport-1643138618: Stopping
2022-01-25 19:29:56,293 - main - INFO - loaded train_dataset length is: 16000
2022-01-25 19:29:56,293 - main - INFO - loaded test_dataset length is: 2000
404 Client Error: Not Found for url: https://huggingface.co/None/resolve/main/config.json
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py", line 550, in get_config_dict
404 Client Error: Not Found for url: https://huggingface.co/None/resolve/main/config.json
resolved_config_file = cached_path(
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1491, in cached_path
output_path = get_from_cache(
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1663, in get_from_cache
r.raise_for_status()
File "/opt/conda/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/None/resolve/main/config.json
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 57, in
model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 396, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 558, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py", line 575, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'None'. Make sure that:
UnexpectedStatusException Traceback (most recent call last)
in
----> 1 huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
--> 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1665 # If logs are requested, call logs_for_jobs.
1666 if logs != "None":
-> 1667 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1668 else:
1669 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3783
3784 if wait:
-> 3785 self._check_job_status(job_name, description, "TrainingJobStatus")
3786 if dot:
3787 print()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3341 ),
3342 allowed_statuses=["Completed", "Stopped"],
-> 3343 actual_status=status,
3344 )
3345
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2022-01-25-19-23-38-888: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise EnvironmentError(msg)
OSError: Can't load config for 'None'. Make sure that: - 'None' is a correct model identifier listed on 'https://huggingface.co/models' (make sure 'None' is not a path to a local directory with something else, in that case) - or 'None' is the correct path to a directory containing a config.json file"
Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 20 --fp16 True --hub_model_id sagemaker-distilbert-emotion --hub_strategy every_save --hub_token --learning_rate 3e-05 --model_id distilbert-base-uncased --push_to_hub True --train_batch_size 10"
I followed the guide from the course, and I got :
It gives "ProcessExitedException: process 0 terminated with signal SIGSEGV"
Reproduce result:
https://github.com/JonathanSum/Hugging-Face-Course/blob/main/Ch3_A_full_training_TPU.ipynb
WARNING: This issue is a replica of this other issue open by me, I ask you sorry if I have open it in the wrong place.
Hello Huggingface's team (@sgugger , @joeddav, @LysandreJik)
I have a problem with this code base
notebooks/examples/question_answering.ipynb - link
ENV: Google Colab - transformers Version: 4.5.0; datasets Version: 1.5.0; torch Version: 1.8.1+cu101;
I am trying to add some domain tokens in the bert-base-cased tokenizer
model_checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
list_of_domain_tokens = ["token1", "token2", "token3"]
tokenizer.add_tokens(list_of_domain_tokens)
...
...
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
print(model.device) # cpu
model.resize_token_embeddings(len(tokenizer))
trainer = Trainer(...)
Then during the trainer.fit() call it report the attached error.
Can you please tell me where I'm wrong?
The tokenizer output is the usual bert inputs expressed in the form of List[List[int]] eg inputs_ids and attention_mask.
So I can't figure out where the problem is with the device
Kind Regards,
Andrea
The TensorFlow model "Helsinki-NLP/opus-mt-en-fr" does not exist and therefore produces an error while executing the notebook.
Steps to Reproduce
from transformers import TFAutoModelForSeq2SeqLM
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
Observed Error
404 Client Error: Not Found for url: https://huggingface.co/Helsinki-NLP/opus-mt-en-fr/resolve/main/tf_model.h5
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
1555 use_auth_token=use_auth_token,
-> 1556 user_agent=user_agent,
1557 )
5 frames
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/Helsinki-NLP/opus-mt-en-fr/resolve/main/tf_model.h5
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
1564 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {TF2_WEIGHTS_NAME}, {WEIGHTS_NAME}.\n\n"
1565 )
-> 1566 raise EnvironmentError(msg)
1567 if resolved_archive_file == archive_file:
1568 logger.info(f"loading weights file {archive_file}")
OSError: Can't load weights for 'Helsinki-NLP/opus-mt-en-fr'. Make sure that:
- 'Helsinki-NLP/opus-mt-en-fr' is a correct model identifier listed on 'https://huggingface.co/models'
(make sure 'Helsinki-NLP/opus-mt-en-fr' is not a path to a local directory with something else, in that case)
- or 'Helsinki-NLP/opus-mt-en-fr' is the correct path to a directory containing a file named one of tf_model.h5, pytorch_model.bin.
Tensorflow model file is missing from the Hugging Face model hub.
I got:
AttributeError: module 'sacrebleu' has no attribute 'DEFAULT_TOKENIZER
when I tried to run "metric = load_metric("sacrebleu")" in "translation.ipynb"
I think sacrebleu version should be specified.
Hello, I use 'language_modeling-tf.ipynb notebook', and I train a model 'Masked language modeling' after finishing my training I try to save my model with. model.save(/content/drive/MyDrive/bert_pre') but I get this error:
ValueError Traceback (most recent call last)
in ()
----> 1 model.save("/content/drive/MyDrive/bert_pre")
1 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/tracking/data_structures.py in _checkpoint_dependencies(self)
875 "dictionary checkpointed, wrap it in a "
876 "non-trackable object; it will be subsequently ignored." % (
--> 877 self, self, self._self_last_wrapped_dict_snapshot))
878 assert not self._dirty # Any reason for dirtiness should have an exception.
879 return super(_DictWrapper, self)._checkpoint_dependencies
ValueError: Unable to save the object {'loss': <function dummy_loss at 0x7f9e28a0a830>, 'logits': None} (a dictionary wrapper constructed automatically on attribute assignment). The wrapped dictionary was modified outside the wrapper (its final value was {'loss': <function dummy_loss at 0x7f9e28a0a830>, 'logits': None}, its value when a checkpoint dependency was added was None), which breaks restoration on object creation.
If you don't need this dictionary checkpointed, wrap it in a non-trackable object; it will be subsequently ignored.
How can I save my model?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.