huggingface / notebooks Goto Github PK

View Code? Open in Web Editor NEW

3.3K 3.3K 1.4K 245.12 MB

Notebooks using the Hugging Face libraries 🤗

License: Apache License 2.0

Jupyter Notebook 99.86% Python 0.13% Makefile 0.01% Shell 0.01%

notebooks's Introduction

notebooks

Notebooks using the Hugging Face libraries 🤗

notebooks's People

Contributors

Stargazers

Watchers

Forkers

huggingworld karen-pal 0xyuzi chpau ratthachat afcarl joeddav mehul27 fivekilometers gopi-durgaprasad stephanielewkowitz milena-sosic skysunlimited chen256 bzantium sujikim6 limberc dengjianyuan altafr ghanashyamvtatti sardor9 alenochka cold-eye nanaakwasiabayieboateng debparth wuyingfeng1hao mwschulte23 satyavishnumolakala aruna-devi96 risheeboy manojyamasani18 drahnreb wenxiong-xu suryasg benjaminhavenaar martinomancino ninenox-dev donnyhai shrave aagarwal937 vinace harshalmittal4 vlamv beatrizingithub nauman-riaz paulowoicho kaveh3000 jjiang4 davidhiyap guillemgsubies lidongyue12138 budhiraja jaidevd aivanni hnishi lewtun chinggyliu waldals zitterbewegung gyanachand1 vincentclaes clickio hubayirp nathanhundley shrikant14 parashar-lonewolf brookzhcn dotaartist dilipti mukesh-mehta jennytran158 montis96 nkanu17 blarney2000 priscilaportela andreabac3 saurabhlinqia mansimane dhawalkp yuti01 dmariko-yseop faezeh-lbf naudinlo charliehpearce pipedflows chaibapchya avinashmane ghoshs ghowoo lisaterumi seandavi stanleychu2 mohammad-abbas-me veereshshringari erzaliator p-sodmann wqx0616 ecsantana76 hot-cheeto xineting

notebooks's Issues

Training script error in 01_getting_started_pytorch/sagemaker-notebook.ipynb

Notebook: 01_getting_started_pytorch/sagemaker-notebook.ipynb

Error:

Invoking script with the following command:

| 2021-04-21T16:46:12.576-07:00 | /opt/conda/bin/python3.6 train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
| 2021-04-21T16:46:17.578-07:00 | Traceback (most recent call last): File "train.py", line 42, in train_dataset = load_from_disk(args.training_dir) File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 781, in load_from_disk return Dataset.load_from_disk(dataset_path, fs) File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in load_from_disk state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
| 2021-04-21T16:46:17.578-07:00 | KeyError: '_data'

The line in Train.py failing is below

train_dataset = load_from_disk(args.training_dir)

I tried to change the notebook to have only "transformers==4.4.2" which matches with the sagemaker's huggingface docker image which is also based on 4.4.2. But still the it seems the state.json is unable to be loaded from the file system (the arrow file)

ProcessExitedException: process 0 terminated with signal SIGSEGV

Hi,
I encountered a SIGSEGV exception while trying to run a copy of the Simple NLP Example on Colab.

The weird thing is I didn't modify the example notebook (except for some minor changes).

The only major thing is that I changed was that I used a Pytorch/XLA sample notebook which I had previously run my examples and copied the simple NLP example code cells into it.

Its really weird because running the original notebook works fine, but this copy triggers an error in
the notebook launcher...

Possible error in the question_answering notebook ?

In the question answering notebook after the cell max_answer_length = 30 this the next cell code

.......
start_indexes` = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()`
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()`
valid_answers = []`
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
  
            continue
........

Here start_indexes can never be more than sequence length becasue the argmax is over an array of 384 sequences i.e the max sequence length . So how does start_index >= len(offset_mapping) & end_index >= len(offset_mapping) make sense here ?

Hyperparameters sent by the client aren't passed to the Training Arguments

Description

The hyperparameters sent by the client have an underscore in them (e.g. train_batch_size), whereas those received by the argparser have a hyphen (e.g. train-batch-size). Therefore, values do not get propagated through the train.py file.

Files

I have tested the solution on these files

notebooks/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb
notebooks/sagemaker/01_getting_started_pytorch/scripts/train.py

but I suspect we'll have to update train.py in the following folders as well - 05_spot_instances, 06_sagemaker_metrics

Solution (based on my observation)

In the train.py file, swap these lines -

parser.add_argument("--train-batch-size", type=int, default=32)
parser.add_argument("--eval-batch-size", type=int, default=64)

parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])

with these

parser.add_argument("--train_batch_size", type=int, default=32)
parser.add_argument("--eval_batch_size", type=int, default=64)

parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])

How to visualize training metrics with Sagemaker Huggingface estimator?

Hello,
I successfully ran Jupyter notebook using Sagemaker Huggingface estimator hosted here. After training completed (sequence classification on imdb dataset), I was able to see the following artifacts in s3:

$ aws s3 ls s3://sagemaker-us-east-1-135890****/****  --recursive    
2021-05-28 00:18:46          0 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/claim.smd
2021-05-28 00:18:47       3424 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/collections/
2021-05-28 00:18:47         97 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/events/
2021-05-28 00:18:47        233 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/
2021-05-28 00:29:24          0 huggingface-pytorch-training-2021-05-28-00-12-07-095/debug-output/training_job_end.ts
2021-05-28 00:29:14  954594573 huggingface-pytorch-training-2021-05-28-00-12-07-095/output/model.tar.gz
2021-05-28 00:29:22        313 huggingface-pytorch-training-2021-05-28-00-12-07-095/output/output.tar.gz
2021-05-28 00:27:17        156 huggingface-pytorch-training-2021-05-28-00-12-07-095/rule-output/ProfilerReport-1622160727/profiler-output/
2021-05-28 00:12:09       1447 huggingface-pytorch-training-2021-05-28-00-12-07-095/source/sourcedir.tar.gz
(session_analyzer_env)

I do not see logs being captured anywhere. How do I run tensorboard / wandb to visualize training metrics?

Question on question_answering.ipynb

Hi, Thank you for awesome example notebooks!
I'm reviewing the notebook about question answering.
https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb

In the notebook, you used postprocess_qa_predictions function to post-process the prediction of model.
(The function is defined at 25th cell in the notebook.)
And the function initializes valid_answers list(local var) for each example, not feature.
However, I thought that valid_answers should be initialized for each feature, since certain valid answer for one feature's context might not exist in another feature's context.

Is there anything that I misundertood?

Thank you

BERT large OOM with TF 2.4.1 + transformer 4.5.0

Configuration

Parameters, Hyperparameters

Key	Value1	Value2
Instance count	1	2
Instance types	p3.2xlarge	p3dn.24xlarge
Models	bert-base-uncased	bert-large-uncased-whole-word-masking
batch_size	2	8
distributions	horovod	smddp

Versions

Huggingface - 2.4.1
Transformer - 4.5.0
DLC - 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-tensorflow-training:2.4.1-transformers4.5.0-gpu-py37-cu110-ubuntu18.04

Experiments

Nodes	Instance Type	bert-base	bert-large
1	p3.2xlarge	success	OOM
2	p3.2xlarge	success	OOM
1	p3dn.24xlarge	success	OOM
2	p3dn.24xlarge	success	OOM

Summary

Independent of distributed training strategy, instance-type, instance-count, TF2.4.1 + HF bert-large suffers from OOM

Entire Stack trace

1,13]<stderr>:2021-05-05 22:35:24.070037: W tensorflow/core/common_runtime/bfc_allocator.cc:433] Allocator (GPU_0_bfc) ran out of memory trying to allocate 32.00MiB (rounded to 33554432)requested by op tf_bert_for_sequence_classification/bert/
encoder/layer_._15/attention/self/transpose_3
[1,13]<stderr>:Current allocation summary follows.
[1,13]<stderr>:2021-05-05 22:35:24.071150: W tensorflow/core/common_runtime/bfc_allocator.cc:441] ****************************************************************************************************
[1,13]<stderr>:2021-05-05 22:35:24.071187: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at transpose_op.cc:184 : Resource exhausted: OOM when allocating tensor with shape[16,512,16,64] and type float on /job:localhost/repli
ca:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[1,13]<stderr>:Traceback (most recent call last):
[1,13]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,13]<stderr>:    "__main__", mod_spec)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,13]<stderr>:    exec(code, run_globals)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stderr>:    main()
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stderr>:    run_command_line(args)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,13]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 263, in run_path
[1,13]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 96, in _run_module_code
[1,13]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,13]<stderr>:    exec(code, run_globals)
[1,13]<stderr>:  File "train_bert.py", line 242, in <module>
[1,13]<stderr>:    main()
[1,13]<stderr>:  File "train_bert.py", line 205, in main
[1,13]<stderr>:    verbose=1 if hvd.rank() == 0 else 0,
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1129, in fit
[1,13]<stderr>:    tmp_logs = self.train_function(iterator)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
[1,13]<stderr>:    result = self._call(*args, **kwds)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
[1,13]<stderr>:    return self._stateless_fn(*args, **kwds)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
[1,13]<stderr>:    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
[1,13]<stderr>:    ctx, args, cancellation_manager=cancellation_manager))
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 560, in call
[1,13]<stderr>:    ctx=ctx)
[1,13]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
[1,13]<stderr>:    inputs, attrs, num_outputs)
[1,13]<stderr>:tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[16,512,16,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[1,13]<stderr>:#011 [[node tf_bert_for_sequence_classification/bert/encoder/layer_._15/attention/self/transpose_3 (defined at /usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py:279) ]]
[1,13]<stderr>:Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[1,13]<stderr>: [Op:__inference_train_function_54347]
[1,13]<stderr>:
[1,13]<stderr>:Errors may have originated from an input operation.
[1,13]<stderr>:Input Source operations connected to node tf_bert_for_sequence_classification/bert/encoder/layer_._15/attention/self/transpose_3:
[1,13]<stderr>: tf_bert_for_sequence_classification/bert/encoder/layer_._15/attention/self/MatMul_1 (defined at /usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py:278)
[1,13]<stderr>:
[1,13]<stderr>:Function call stack:
[1,13]<stderr>:train_function
[1,13]<stderr>:

I can't run deploy_transformer_model_from_hf_hub.ipynb on AWS Sagemaker notebook instance

Hi,
When I ran deploy_transformer_model_from_hf_hub.ipynb on AWS Sagemaker notebook instance with conda_pytorch_p36 kernel, I got the following error message for "from sagemaker.huggingface import HuggingFaceModel" command:

ImportError: cannot import name 'HuggingFaceModel'

The official website says that I should use "from sagemaker.huggingface.model import HuggingFaceModel" instead of "from
sagemaker.huggingface import HuggingFaceModel" which is used in "deploy_transformer_model_from_hf_hub.ipynb". See the following two resources for your reference:

I just want to inform you that I have already updated "sagemaker" package in the Notebook instance by running "pip install sagemaker --upgrade"

After using "from sagemaker.huggingface.model import HuggingFaceModel", I got rid of the error message mentioned above and I was able to run the following code block:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()

hub = {
'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
'HF_TASK':'question-answering' # NLP task you want to use for predictions
}

huggingface_model = HuggingFaceModel(
env=hub,
role=role,
transformers_version="4.6",
pytorch_version="1.7",
py_version="py36",
)

However, I couldn't run the following code block which generated an error message, which I've attached in a doc file.
predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.m5.xlarge" )

I'd appreciate your help.

Best,
Farshad

deploy_error_message.docx

Notebook: Question Answering on SQUAD: IndexError: list index out of range

Hello, I am running this notebook Question Answering on SQUAD using Colab: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb

I got the IndexError in this step, could you please have a look how to fix it? Thanks!

final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

This is the output:

Post-processing 10570 example predictions split into 10784 features.
9%
1000/10570 [00:02<00:22, 420.59it/s]

IndexError Traceback (most recent call last)
in ()
----> 1 final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

in postprocess_qa_predictions(examples, features, raw_predictions, n_best_size, max_answer_length)
57 continue
58
---> 59 start_char = offset_mapping[start_index][0]
60 end_char = offset_mapping[end_index][1]
61 valid_answers.append(

IndexError: list index out of range

Contrastive Training for dense retriever using pertained dpr

Hello @yjernite

Thanks for providing a nice tutorial on training an unsupervised retriever here.

I was wondering if you can provide instruction on how to modify this snippet of your code, to be able to start from a pre-trained dpr model (DPRContextEncoder) rather than the distilled Bert model. So the dpr context encoder doesn't have the embeddings, should we just use the encoder on one mini-batch at a time and skip running the embeddings layer on everything at once for doing checkpointing?

Thanks

Instead of picking up the metric, hyperparam search is using SamplePerSeconds to select the best model

Hi,
I am trying to replicate the notebook on this path : https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb

But in my run, instead of selecting the model using mathews correlation, its selecting a model based on the epoch running time. For example, look at the log printed in the notebook which I run: Trial 0 finished with value: 1645.5148768624724 and parameters: {'learning_rate': 2.0970346847322057e-05, 'num_train_epochs': 5, 'seed': 35, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 1645.5148768624724.

Interestingly, the tables printed in my notebook have two additional columns: RunTime and SamplesPerSeconds. My guess is the library code is picking up the last column of the dataframe and not the one with the name of the metric. Adding a snapshot of the table below.

Epoch	Training Loss	Validation Loss	Matthews Correlation	Runtime	Samples Per Second
1	No log	0.482555	0.435778	0.631800	1650.953000
2	0.450100	0.494479	0.488171	0.632700	1648.565000
3	0.450100	0.574674	0.510249	0.637900	1635.140000
4	0.218400	0.637276	0.519209	0.643200	1621.627000
5	0.218400	0.680201	0.520577	0.634300	1644.360000

Can someone please help me with whether I am missing something?

Chapter 7, Section 2 Token classification: tf_train_dataset not defined

Hi,

I'm missing the definition of the tf_train_dataset variable here.

Best,
Florian

prediction after loading the fine tuned model fails: 'BaseModelOutput' object has no attribute 'logits'

I extended the training and evaluation process here https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-native-pytorch-tensorflow to save the fine-tuned model and use it for prediction separately. Here is the code for it.

true_labels, predicted_labels = [], []
model.eval()

for batch in eval_dataloader:
    
    batch_labels = batch['labels'].numpy()
    true_labels.extend(batch_labels)

    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    batch_predictions = predictions.to('cpu').numpy()
    predicted_labels.extend(batch_predictions)

model.save_pretrained('imdb_custom_dataset')
from transformers import AutoModel
model = AutoModel.from_pretrained("/content/imdb_custom_dataset")

When I try to predict using the loaded model I encounter this error AttributeError: 'BaseModelOutput' object has no attribute 'logits' . The code used for it is below.

model.eval()
for batch in test_dataloader:
    break
test_sample = {k: v for k, v in batch.items() if k != 'labels'}
outputs_sample = model(**test_sample)
logits_sample = outputs_sample.logits

Error details:

AttributeError                            Traceback (most recent call last)
<ipython-input-20-a99e37f72baa> in <module>()
      4 test_sample = {k: v for k, v in batch.items() if k != 'labels'}
      5 outputs_sample = model(**test_sample)
----> 6 logits_sample = outputs_sample.logits

AttributeError: 'BaseModelOutput' object has no attribute 'logits'

Any help on this issue ?
Thank you

A simple English-Multilingual Translation notebook

I had created a notebook which Translates English to almost 130+ with Helsinki NLP's opus-mt-en-mul which works similar to google translate. Instead of loading a specific model for every language pair, a multilingual model is used here.

Consider adding this to examples repo if needed as it would be easy for beginners exploring multilingual machine translation

Unknown task automatic-speech-recognition (SageMaker)

Hi,
I was trying to deploy wav2vec fine-tuned model to AWS sagemaker but it seems that the automatic-speech-recognition task has not been implemented yet.

Any clue how I can perform a prediction to a huggingface wav2vec model? I have successfully deployed the model and created an Endpoint.

Thanks

01_getting_started_pytorch notebook - Hyperparameters sent by the client aren't passed to the Training Arguments

Hello!

Reopening an issue connected to this thread here:

Thank you HuggingFace team for all you do! This summer I have been working from this notebook when I noticed a gap I will discuss below. PS - this is my first GitHub issue, if you have any feedback.

Description (same as 52)

The hyperparameters sent by the client have an underscore in them (e.g. output_data_dir), whereas those received by the argparser have a hyphen (e.g. output-data-dir). Therefore, values do not get propagated through the train.py file.

Why another issue?

There have been recent commits resolving the above-linked issue in most of the notebooks. However I noticed the commit to fix this issue for Pytorch missed the second half of the typos (commit fixed train-batch-size and eval-batch-size, but still need to fix output-data-dir and model-dir).

Relevant Commits

9df51d5 (tensorflow + others)
c3fa5b5 (half the fix for pytorch)

Files

I have tested the solution on these files

notebooks/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb
notebooks/sagemaker/01_getting_started_pytorch/scripts/train.py

Solution

In the train.py file, swap these lines -

parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
with these

parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])

Implement automated notebook testing

As more and more users are using HuggingFace along with Amazon SageMaker, we are seeing a need to make sure the example notebooks available in this repo are working correctly and intercept any bugs if any pro-actively so that we can identify them before the end users starts experiencing the problems. We need an automated mechanism to regression test these notebooks by periodically executing all the notebooks and report if there are any errors. The errors should create a new issue to be resolved.
This issue is to capture discussion around the practicality of implementing automated testing on this repository. Any thoughts would be greatly appreciated!
@philschmid - If you like to discuss more on this topic.

【Need Help!】 About handling of the "labels" in the Huggingface Tutorial

Hi @sgugger , I'm a beginner to Huggingface, I really love your tutorial which is best course I've ever seen in AI.

However, I got a little confused in the tutorial "Fine-tuning a pretrained model-A full training" part (https://huggingface.co/course/chapter3/4?fw=pt),
there mentioned:

# Rename the column label to labels (because the model expects the argument to be named labels).
...
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
...
train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
...

I don't think we have to manually rename the "label" to "labels", since in the source code of data_collator.py, there is:

class DataCollatorWithPadding:
  
    ...

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        ...
        if "label" in batch:
            batch["labels"] = batch["label"]
            del batch["label"]
        if "label_ids" in batch:
            batch["labels"] = batch["label_ids"]
            del batch["label_ids"]
        return batch

where the column "lable" has already been changed to "labels".

I have tested the version WITHOUT the line below:

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

And found that the "label" has been automatically changed to "labels":

tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2','idx'])
# tokenized_datasets = tokenized_datasets.rename_column('label','labels')
tokenized_datasets.set_format('torch')
print(tokenized_datasets['train'].column_names)

output: ['attention_mask', 'input_ids','label', 'token_type_ids']

from torch.utils.data import DataLoader, Dataset
train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=8, collate_fn=data_collator)
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

output: {'attention_mask': torch.Size([8, 65]),
'input_ids': torch.Size([8, 65]),
'token_type_ids': torch.Size([8, 65]),
'labels': torch.Size([8])}

That is, "label" has been automatically changed to "labels" by the data_collator.

Probably-unintended metric RegEx in SageMaker notebook

Hi all & thanks for the examples!

I see that in SageMaker notebook 6 some metrics are set up as follows:

metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]

I had an issue today with the way learning_rate was picking up (missing the exponent) and realised that the RegEx's used here probably aren't doing what's intended.

They only match either a decimal point .123 or an exponent e-123, not both (e.g. in a learning rate like 1.03e-6
...And in fact you have to have exactly one of those (so no integer metrics e.g. 42)
They only match negative exponents (which granted is the usual use case, but maybe not for every metric?)
They only match positive numbers (which granted is the usual use case, but maybe not for every metric?)
We have unescaped backslashes and haven't indicated it's a raw string via e.g. r"'loss': ...\..."

If careful validation is the aim, maybe it could be something more like the following?

r"'my_metric': (-?[0-9]+(\.[0-9]+)?(e[-+]?[0-9]+)?),?"

...Or perhaps we could be a little more concise and trusting on the numbers, with something like the below?

r"'my_metric': ([-+0-9e.]+)[,}]"

As long as the expression is able to articulate what comes immediately after the number (always comma or close brace as far as I can tell?), we could even be super lazy and e.g. 'my_metric': (.*?)[,}].

The notebook language_modeling.ipynb is not accessible

Hello,

I can not open the notebook language_modeling.ipynb.

Instead, the message "An error occurred" is displayed.

cc @sgugger

Pierre

export to ONNX tutorial doesn't work

Hi,

I tried to succeed this tutorial
https://github.com/huggingface/notebooks/blob/master/examples/onnx-export.ipynb

but I just get some error like below.. what should I do to solve this problem..?


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-23-06a1d0b2a7b7> in <module>()
      7 # opt_options.enable_embed_layer_norm = False
      8 
----> 9 optimized_model = optimizer.optimize_model("onnx/bert-base-cased.onnx", model_type='bert', num_heads=12, hidden_size=768)
     10 optimized_model.save_model_to_file('bert.opt.onnx')

8 frames
/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/optimizer.py in optimize_model(input, model_type, num_heads, hidden_size, optimization_options, opt_level, use_gpu, only_onnxruntime)
    310 
    311     if not only_onnxruntime:
--> 312         optimizer.optimize(optimization_options)
    313 
    314     # Remove the temporary model.

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/onnx_model_bert.py in optimize(self, options, add_dynamic_axes)
    277 
    278         if (options is None) or options.enable_skip_layer_norm:
--> 279             self.fuse_skip_layer_norm()
    280 
    281         if (options is None) or options.enable_attention:

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/onnx_model_bert.py in fuse_skip_layer_norm(self)
    103 
    104     def fuse_skip_layer_norm(self):
--> 105         fusion = FusionSkipLayerNormalization(self)
    106         fusion.apply()
    107 

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/fusion_skiplayernorm.py in __init__(self, model)
     19     def __init__(self, model: OnnxModel):
     20         super().__init__(model, "SkipLayerNormalization", "LayerNormalization")
---> 21         self.shape_infer_helper = self.model.infer_runtime_shape({"batch_size": 4, "seq_len": 7})
     22 
     23     def fuse(self, node, input_name_to_nodes, output_name_to_node):

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/onnx_model.py in infer_runtime_shape(self, dynamic_axis_mapping, update)
     34             shape_infer_helper = self.shape_infer_helper
     35 
---> 36         if shape_infer_helper.infer(dynamic_axis_mapping):
     37             return shape_infer_helper
     38         return None

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/shape_infer_helper.py in infer(self, dynamic_axis_mapping)
     33         self._preprocess(self.model_)
     34         while self.run_:
---> 35             self.all_shapes_inferred_ = self._infer_impl()
     36 
     37         self.inferred_ = True

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/../symbolic_shape_infer.py in _infer_impl(self, start_sympy_data)
   1301                     in_dims = [s[len(s) - out_rank + d] for s in in_shapes if len(s) + d >= out_rank]
   1302                     if len(in_dims) > 1:
-> 1303                         self._check_merged_dims(in_dims, allow_broadcast=True)
   1304 
   1305             for i_o in range(len(node.output)):

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/../symbolic_shape_infer.py in _check_merged_dims(self, dims, allow_broadcast)
    527             dims = [d for d in dims if not (is_literal(d) and int(d) <= 1)]
    528         if not all([d == dims[0] for d in dims]):
--> 529             self._add_suggested_merge(dims, apply=True)
    530 
    531     def _compute_matmul_shape(self, node, output_dtype=None):

/usr/local/lib/python3.7/dist-packages/onnxruntime_tools/transformers/../symbolic_shape_infer.py in _add_suggested_merge(self, symbols, apply)
    156 
    157     def _add_suggested_merge(self, symbols, apply=False):
--> 158         assert all([(type(s) == str and s in self.symbolic_dims_) or is_literal(s) for s in symbols])
    159         symbols = set(symbols)
    160         for k, v in self.suggested_merge_.items():

AssertionError:

Additional information in text classification example

Hi, firstly, thank you for this very useful resource!

I was adapting the text classification example for my own data, but was having trouble figuring out how to:

Inspect the predictions made by the model on the test dataset (it seems previous versions of transformers use a dataloader, but I couldn't get it to work here) / AKA run the model on some new arbitrary text data.
Export the model for use in production

Could you direct me to some resources to find out more about these? (Including these in the notebook might be useful too to newbies like myself.)

Typo on the local GPU of Amazon SageMaker

The following is about the description part, not the code in sample notebooks.
Please fix when updating

local-gpu → local_gpu

mT5 fine-tune for en-my got "NaN" in training loss and validation loss

I tried to fine-tune mT5 for English->Myanmar translation from Tatoeba-Challenge Dataset. I followed to train this notebook example of en-ro translation. And I used model_checkpoint as "google/mt5-small". I tested 1~4 epoch training.
The following is the training parameters, I reduced the batch_size as 4.

batch_size=4
args = Seq2SeqTrainingArguments(
"mt5-translate-en-my",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=1,
predict_with_generate=True,
fp16=True, )

I got "NaN" in training loss and validation loss as below:

Can you please help me how do I do it? Thanks in advance.

How to enable 'multi_class' predictor within Sagemaker?

I'm using the zeroshot pipeline with the valhalla/distilbart-mnli-12-9 model. How do I enable multi_class classification? When using the transformer w/ pytorch in python, I pass the argument multi_class=True, but I can't find the appropriate way to do this in Sagemaker. See code below:

from sagemaker.huggingface.model import HuggingFaceModel

# Hub Model configuration. <https://huggingface.co/models>
model = 'valhalla/distilbart-mnli-12-9'

hub = {
  'HF_MODEL_ID': model, # model_id from hf.co/models
  'HF_TASK':'zero-shot-classification' # NLP task you want to use for predictions,
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub, # configuration for loading model from Hub
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
    py_version="py36"
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.p2.xlarge",
    multi_label= True
)

missing an import in 01_getting_started_pytorch/sagemaker-notebook.ipynb

I'm working through the notebook noted in the title, in Sagemaker Studio.

Under the heading "Fine-tuning & starting Sagemaker Training Job" the block of code there throws an error, about not being able to find HuggingFace. The following import resolve the error.

from sagemaker.huggingface import HuggingFace

minor issue, just letting you know.

`undefined symbol` error when running chapter 7 notebooks on a GPU instance

Description

Running the chapter 7 notebooks on a GPU Colab instance throws the following error during training:

RuntimeError: Failed to import transformers.training_args because of the following error (look up to see its traceback):
/usr/local/lib/python3.7/dist-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN2at13_foreach_erf_EN3c108ArrayRefINS_6TensorEEE

Could this be related to the installation of Pytorch from https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl introduced in 2dedfdf?

Note that the notebooks are installing transformers==4.12.5 and torch @ https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl

Summarization: model.to(device)?

In the summarization notebook, where/when do we set the device? Are parallel gpus expected? Two things that could help: i) specify where we could set the device and call model.to(device) and ii) explicate where the model might expect data in parallel e.g. how setting batched=True in the pre-processing or how DataCollatorForSeq2Seq expects tensors.

Contributions

I have some notebooks illustrating the use of transformers, may I make a PR?

Longform QA: Alternative datasets not working

Longform QA notebook uses the wiki40b dataset, which is huge to download and work with. I couldn't get other alternative datasets to work with it as the model expects the format of wiki40b.

What should be the correct approach to get any text corpus to work with this notebook?

Can't run the SageMaker PyTorch Getting Started notebook

In the Huggingface Sagemaker-sdk - Getting Started Demo, when I run the load dataset cell I get the error pasted below. I am running the notebook in SageMaker Studio, I have tried both the Data Science and PyTorch 1.6 kernels.

NonMatchingSplitsSizesError Traceback (most recent call last)
in
1 # load dataset
----> 2 dataset = load_dataset(dataset_name)
3
4 # download tokenizer
5 tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

/opt/conda/lib/python3.6/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, **config_kwargs)
749 try_from_hf_gcs=try_from_hf_gcs,
750 base_path=base_path,
--> 751 use_auth_token=use_auth_token,
752 )
753

/opt/conda/lib/python3.6/site-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
573 if not downloaded_from_gcs:
574 self._download_and_prepare(
--> 575 dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
576 )
577 # Sync info

/opt/conda/lib/python3.6/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
660
661 if verify_infos:
--> 662 verify_splits(self.info.splits, split_dict)
663
664 # Update the info object with the splits.

/opt/conda/lib/python3.6/site-packages/datasets/utils/info_utils.py in verify_splits(expected_splits, recorded_splits)
72 ]
73 if len(bad_splits) > 0:
---> 74 raise NonMatchingSplitsSizesError(str(bad_splits))
75 logger.info("All the splits matched successfully.")
76

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='test', num_bytes=32660064, num_examples=25000, dataset_name='imdb'), 'recorded': SplitInfo(name='test', num_bytes=9982987, num_examples=7726, dataset_name='imdb')}, {'expected': SplitInfo(name='train', num_bytes=33442202, num_examples=25000, dataset_name='imdb'), 'recorded': SplitInfo(name='train', num_bytes=0, num_examples=0, dataset_name='imdb')}, {'expected': SplitInfo(name='unsupervised', num_bytes=67125548, num_examples=50000, dataset_name='imdb'), 'recorded': SplitInfo(name='unsupervised', num_bytes=0, num_examples=0, dataset_name='imdb')}]

Add readme in subfolders that allow to open notebooks directly in colab notebook

datasets.load_metric() function is not working?

Hi,

I'm trying to follow the tutorial on text classification, however, when I call load_metrics(), it throws out the following error message:

AttributeError Traceback (most recent call last)
in
1 actual_task = "mnli" if task == "mnli-mm" else task
----> 2 metric = load_metric('glue', actual_task)
3 metric
~/.local/lib/python3.6/site-packages/datasets/load.py in load_metric(path, config_name, process_id, num_process, cache_dir, experiment_id, keep_in_memory, download_config, download_mode, script_version, **metric_init_kwargs)
498 dataset=False,
499 )
--> 500 metric_cls = import_main_class(module_path, dataset=False)
501 metric = metric_cls(
502 config_name=config_name,
~/.local/lib/python3.6/site-packages/datasets/load.py in import_main_class(module_path, dataset)
64 """
65 importlib.invalidate_caches()
---> 66 module = importlib.import_module(module_path)
67
68 if dataset:
/usr/lib/python3.6/importlib/init.py in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
127
128
/usr/lib/python3.6/importlib/_bootstrap.py in _gcd_import(name, package, level)
/usr/lib/python3.6/importlib/_bootstrap.py in find_and_load(name, import)
/usr/lib/python3.6/importlib/_bootstrap.py in find_and_load_unlocked(name, import)
/usr/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)
/usr/lib/python3.6/importlib/_bootstrap_external.py in exec_module(self, module)
/usr/lib/python3.6/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)
~/.cache/huggingface/modules/datasets_modules/metrics/glue/e4606ab9804a36bcd5a9cebb2cb65bb14b6ac78ee9e6d5981fa679a495dd55de/glue.py in
103
104
--> 105 @datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
106 class Glue(datasets.Metric):
107 def _info(self):
AttributeError: module 'datasets.utils.file_utils' has no attribute 'add_start_docstrings'

I was able to successfully reproduce the fine-tuning process a month ago but got the error above today. The code are completely the same as the notebook. Any ideas on what might go wrong? Thanks a lot!

Request: colab link

You can add open on colab option to every notebook by simply adding a cell with the code:

`<td>
    <a target="_blank" href="https://colab.research.google.com/PUT GITHUB URL HERE">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
</td>`

The image is the logo of colab, this can work without the image:

`<td>
    <a target="_blank" href="https://colab.research.google.com/PUT GITHUB URL HERE">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
</td>`

Sagemaker HF '01_getting_started_pytorch' tutorial training not working

I'm trying to run '01_getting_started_pytorch' tutorial with no luck from a sagemaker instance. I tried the original notebook with no luck and then adapted code from here but it's still a no go (notebook below). anyone have any hints on what I'm doing wrong?

tutorial notetebook fail

CUDA out of memory Error when run trainer.hyperparameter_search()

hi here,
I got the following error when run trainer.hyperparameter_search() on databricks:
RuntimeError: CUDA out of memory. Tried to allocate 300.00 MiB (GPU 0; 11.17 GiB total capacity; 10.18 GiB already allocated; 274.44 MiB free; 10.50 GiB reserved in total by PyTorch)

my dataset is very small, 60 sentences in total, training epoch =20, training batch and eval batch are 8 and 2 respectively
after the trial 5, I got this message, any idea to solve this problem?

Tiny bug maybe

hello!
Great code and notebook, and I really love ELI5.
Just want to signal that it seems that if I'm not mistaken, the model was trained on 2 * n_results without minimum length filtering, instead of n_results. This is likely not a big deal.
Seems we are using the document generated in https://github.com/huggingface/notebooks/blob/master/longform-qa/lfqa_utils.py#L595 in the training set, specifically, support_doc https://github.com/huggingface/notebooks/blob/master/longform-qa/lfqa_utils.py#L601
However, we never limit the number of results in support_doc to n_results (it should be 2 * n_results from the index.search call), and we never exclude the results that are shorter than the minimum length.
We do for res_list, but it's not that that get's used at training time:
from the notebook, we can see that src_ls gets ignored.

eli5_train_docs = json.load(open('precomputed/eli5_train_precomputed_dense_docs.json'))
eli5_valid_docs = json.load(open('precomputed/eli5_valid_precomputed_dense_docs.json'))
s2s_train_dset = ELI5DatasetS2S(eli5['train_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_train_docs]))
s2s_valid_dset = ELI5DatasetS2S(eli5['validation_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_valid_docs]), training=False)

And later

    question_doc = "question: {} context: {}".format(question, doc)

Anyways, just wanted to let you know

Loading Bert-large model results in OOM

On a 2node p3dn.24xlarge instances, I find a OOM issue while trying to load a pre-trained bert-large-uncased-whole-word-masking model

Script

import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained('bert-large-uncased-whole-word-masking')

SM Launcher

import sagemaker
from sagemaker.huggingface import HuggingFace

role = sagemaker.get_execution_role()
distribution={'mpi': {'enabled':True,"custom_mpi_options":"-verbose --NCCL_DEBUG=INFO -x RDMAV_FORK_SAFE=1"}}

# instance configurations
instance_type='ml.p3dn.24xlarge'
instance_count=2

huggingface_estimator = HuggingFace(
    entry_point='model_load_hf_bert.py',
    source_dir='.',
    instance_type=instance_type,
    role=role,
    instance_count=instance_count,
    transformers_version='4.5.0',
    tensorflow_version='2.4.1',
    py_version='py37',
    distribution=distribution,
    debugger_hook_config=False, # currently needed
)
huggingface_estimator.fit()

Nodes	Instance Type	Result
1	p3.2xlarge	success
2	p3.2xlarge	success
1	p3dn.24xlarge	OOM
2	p3dn.24xlarge	OOM

This issue is observed

regardless of number of nodes

Specifically the line where it fails

notebooks/sagemaker/07_tensorflow_distributed_training_data_parallelism/scripts/train.py

Line 125 in 7a3bbdd

model = TFAutoModelForSequenceClassification.from_pretrained(args.model_name)

For a detailed Stack Trace

[1,14]<stderr>:2021-05-04 22:39:11.526981: F ./tensorflow/core/kernels/random_op_gpu.h:232] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory
[1,15]<stderr>:2021-05-04 22:39:21.984104: W tensorflow/core/common_runtime/bfc_allocator.cc:431] Allocator (GPU_0_bfc) ran out of memory trying to allocate 119.23MiB (rounded to 125018112)requested by op TruncatedNormal

[1,15]<stderr>:Current allocation summary follows.
[1,15]<stderr>:2021-05-04 22:39:21.984204: W tensorflow/core/common_runtime/bfc_allocator.cc:439] *___________________________________________________________________________________________________
[1,15]<stderr>:2021-05-04 22:39:21.985568: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at random_op.cc:77 : Resource exhausted: OOM when allocating tensor with shape[30522,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[1,15]<stderr>:Traceback (most recent call last):
[1,15]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1,15]<stderr>:    "__main__", mod_spec)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,15]<stderr>:    exec(code, run_globals)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,15]<stderr>:    main()
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 196, in main
[1,15]<stderr>:    run_command_line(args)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,15]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,15]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 263, in run_path
[1,15]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 96, in _run_module_code
[1,15]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
[1,15]<stderr>:    exec(code, run_globals)
[1,15]<stderr>:  File "hf_bert_public.py", line 125, in <module>
[1,15]<stderr>:    model = TFAutoModelForSequenceClassification.from_pretrained(args.model_name)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/transformers/models/auto/auto_factory.py", line 360, in from_pretrained
[1,15]<stderr>:    pretrained_model_name_or_path, *model_args, config=config, **kwargs
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/transformers/modeling_tf_utils.py", line 1271, in from_pretrained
[1,15]<stderr>:    model(model.dummy_inputs)  # build the network with dummy inputs
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 999, in __call__
[1,15]<stderr>:    outputs = call_fn(inputs, *args, **kwargs)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py", line 1450, in call
[1,15]<stderr>:    training=inputs["training"],
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 999, in __call__
[1,15]<stderr>:    outputs = call_fn(inputs, *args, **kwargs)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py", line 650, in call
[1,15]<stderr>:    training=inputs["training"],
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 996, in __call__
[1,15]<stderr>:    self._maybe_build(inputs)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2688, in _maybe_build
[1,15]<stderr>:    self.build(input_shapes)  # pylint:disable=not-callable
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py", line 152, in build
[1,15]<stderr>:    initializer=get_initializer(self.initializer_range),
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 616, in add_weight
[1,15]<stderr>:    caching_device=caching_device)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 750, in _add_variable_with_custom_getter
[1,15]<stderr>:    **kwargs_for_getter)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 145, in make_variable
[1,15]<stderr>:    shape=variable_shape if variable_shape else None)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
[1,15]<stderr>:    return cls._variable_v1_call(*args, **kwargs)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
[1,15]<stderr>:    shape=shape)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
[1,15]<stderr>:    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2597, in default_variable_creator
[1,15]<stderr>:    shape=shape)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
[1,15]<stderr>:    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1518, in __init__
[1,15]<stderr>:    distribute_strategy=distribute_strategy)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1651, in _init_from_args
[1,15]<stderr>:    initial_value() if init_from_fn else initial_value,
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/initializers/initializers_v2.py", line 342, in __call__
[1,15]<stderr>:    return super(TruncatedNormal, self).__call__(shape, dtype=_get_dtype(dtype))
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/init_ops_v2.py", line 450, in __call__
[1,15]<stderr>:    self.stddev, dtype)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/init_ops_v2.py", line 1053, in truncated_normal
[1,15]<stderr>:    shape=shape, mean=mean, stddev=stddev, dtype=dtype, seed=self.seed)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
[1,15]<stderr>:    return target(*args, **kwargs)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/random_ops.py", line 196, in truncated_normal
[1,15]<stderr>:    shape_tensor, dtype, seed=seed1, seed2=seed2)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/ops/gen_random_ops.py", line 902, in truncated_normal
[1,15]<stderr>:    _ops.raise_from_not_ok_status(e, name)
[1,15]<stderr>:  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
[1,15]<stderr>:    six.raise_from(core._status_to_exception(e.code, message), None)
[1,15]<stderr>:  File "<string>", line 3, in raise_from
[1,15]<stderr>:tensorflow.python.framework.errors_impl.ResourceExhaustedError[1,15]<stderr>:: OOM when allocating tensor with shape[30522,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TruncatedNormal]

Jupyter Notebook: IPygress Issue with the Progress Bar

When executing the transformers pipeline in a Jupyter Notebook, I have had couple of times issues with the progress bar showing the download of a NLP model. The error thrown was an Import Error showing troubles with IPygress and the Jupyter Widgets and Extensions.

Question Answering

Is there a way to make the code more generalized: https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb
is it necessary to have already the questions and answers encoded beforehand.
Can you have the questions not put in beforehand

bug in question answering notebook

code in function postprocess_qa_predictions

  if min_null_score is None or min_null_score < feature_null_score:
      min_null_score = feature_null_score

I think < should be replaced by >

General case question_answering.ipynb

For the notebook: question_answering.ipynb
as you mentioned in a commented section: for a more general case, we will need to match sample_id to an example index.
For the example you provided did you know what part of the dictionary the answer was in?
I am working with a txt document and I want to find a general way to find an answer based on a question within the txt document?

what is the ' token classification head'?

Training script error in 02_getting_started_tensorflow/sagemaker-notebook.ipynb

When the train.py script is used as is, the model doesn't train successfully on a SageMaker notebook instance, when using the built in conda_tensorflow2_p36 conda environment. This seems to be due to the dataset not being shuffled. The model always outputs LABEL_0, and achieves a test accuracy of 50%.

Adding:

train_dataset = train_dataset.shuffle()

at line 46 seems to solve this issue. Upon retraining the model functions as expected, and achieves a test accuracy of 89.53%.

Failed. Reason: AlgorithmError: ExecuteUserScriptError:

Hi All,

I am trying to replicate the attach code and still getting the above error. Can you suggest any solution?

Code :- https://github.com/huggingface/notebooks/blob/master/sagemaker/14_train_and_push_to_hub/sagemaker-notebook.ipynb

Error:- Error for Training job huggingface-pytorch-training-2022-01-25-19-23-38-888: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise EnvironmentError(msg)

LOG:-

2022-01-25 19:26:23 Training - Downloading the training image....................bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-01-25 19:29:48,286 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-01-25 19:29:48,307 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-01-25 19:29:51,328 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-01-25 19:29:51,774 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {
"test": "/opt/ml/input/data/test",
"train": "/opt/ml/input/data/train"
},
"current_host": "algo-1",
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"hub_token": null,
"model_id": "distilbert-base-uncased",
"eval_batch_size": 20,
"train_batch_size": 10,
"push_to_hub": true,
"hub_model_id": "sagemaker-distilbert-emotion",
"epochs": 1,
"learning_rate": 3e-05,
"hub_strategy": "every_save",
"fp16": true
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"test": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
},
"train": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "huggingface-pytorch-training-2022-01-25-19-23-38-888",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-eu-west-2-352316401451/huggingface-pytorch-training-2022-01-25-19-23-38-888/source/sourcedir.tar.gz",
"module_name": "train",
"network_interface_name": "eth0",
"num_cpus": 8,
"num_gpus": 1,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"hosts": [
"algo-1"
],
"network_interface_name": "eth0"
},
"user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"epochs":1,"eval_batch_size":20,"fp16":true,"hub_model_id":"sagemaker-distilbert-emotion","hub_strategy":"every_save","hub_token":null,"learning_rate":3e-05,"model_id":"distilbert-base-uncased","push_to_hub":true,"train_batch_size":10}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["test","train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-2-352316401451/huggingface-pytorch-training-2022-01-25-19-23-38-888/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":1,"eval_batch_size":20,"fp16":true,"hub_model_id":"sagemaker-distilbert-emotion","hub_strategy":"every_save","hub_token":null,"learning_rate":3e-05,"model_id":"distilbert-base-uncased","push_to_hub":true,"train_batch_size":10},"input_config_dir":"/opt/ml/input/config","input_data_config":{"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2022-01-25-19-23-38-888","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-2-352316401451/huggingface-pytorch-training-2022-01-25-19-23-38-888/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--epochs","1","--eval_batch_size","20","--fp16","True","--hub_model_id","sagemaker-distilbert-emotion","--hub_strategy","every_save","--hub_token","","--learning_rate","3e-05","--model_id","distilbert-base-uncased","--push_to_hub","True","--train_batch_size","10"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_HUB_TOKEN=
SM_HP_MODEL_ID=distilbert-base-uncased
SM_HP_EVAL_BATCH_SIZE=20
SM_HP_TRAIN_BATCH_SIZE=10
SM_HP_PUSH_TO_HUB=true
SM_HP_HUB_MODEL_ID=sagemaker-distilbert-emotion
SM_HP_EPOCHS=1
SM_HP_LEARNING_RATE=3e-05
SM_HP_HUB_STRATEGY=every_save
SM_HP_FP16=true
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 20 --fp16 True --hub_model_id sagemaker-distilbert-emotion --hub_strategy every_save --hub_token --learning_rate 3e-05 --model_id distilbert-base-uncased --push_to_hub True --train_batch_size 10

2022-01-25 19:30:06 Uploading - Uploading generated training model
2022-01-25 19:30:06 Failed - Training job failed
ProfilerReport-1643138618: Stopping
2022-01-25 19:29:56,293 - main - INFO - loaded train_dataset length is: 16000
2022-01-25 19:29:56,293 - main - INFO - loaded test_dataset length is: 2000
404 Client Error: Not Found for url: https://huggingface.co/None/resolve/main/config.json
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py", line 550, in get_config_dict
404 Client Error: Not Found for url: https://huggingface.co/None/resolve/main/config.json
resolved_config_file = cached_path(
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1491, in cached_path
output_path = get_from_cache(
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1663, in get_from_cache
r.raise_for_status()
File "/opt/conda/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/None/resolve/main/config.json
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 57, in
model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 396, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 558, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py", line 575, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'None'. Make sure that:

'None' is a correct model identifier listed on 'https://huggingface.co/models'
(make sure 'None' is not a path to a local directory with something else, in that case)
or 'None' is the correct path to a directory containing a config.json file
2022-01-25 19:29:57,197 sagemaker-training-toolkit ERROR Reporting training FAILURE
2022-01-25 19:29:57,197 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise EnvironmentError(msg)
OSError: Can't load config for 'None'. Make sure that: - 'None' is a correct model identifier listed on 'https://huggingface.co/models' (make sure 'None' is not a path to a local directory with something else, in that case) - or 'None' is the correct path to a directory containing a config.json file"
Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 20 --fp16 True --hub_model_id sagemaker-distilbert-emotion --hub_strategy every_save --hub_token --learning_rate 3e-05 --model_id distilbert-base-uncased --push_to_hub True --train_batch_size 10"
2022-01-25 19:29:57,197 sagemaker-training-toolkit ERROR Encountered exit_code 1

UnexpectedStatusException Traceback (most recent call last)
in
----> 1 huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
--> 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1665 # If logs are requested, call logs_for_jobs.
1666 if logs != "None":
-> 1667 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1668 else:
1669 self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3783
3784 if wait:
-> 3785 self._check_job_status(job_name, description, "TrainingJobStatus")
3786 if dot:
3787 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3341 ),
3342 allowed_statuses=["Completed", "Stopped"],
-> 3343 actual_status=status,
3344 )
3345

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2022-01-25-19-23-38-888: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise EnvironmentError(msg)
OSError: Can't load config for 'None'. Make sure that: - 'None' is a correct model identifier listed on 'https://huggingface.co/models' (make sure 'None' is not a path to a local directory with something else, in that case) - or 'None' is the correct path to a directory containing a config.json file"
Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 20 --fp16 True --hub_model_id sagemaker-distilbert-emotion --hub_strategy every_save --hub_token --learning_rate 3e-05 --model_id distilbert-base-uncased --push_to_hub True --train_batch_size 10"

Colab doesn't run the TPU example on Ch3

I followed the guide from the course, and I got :
It gives "ProcessExitedException: process 0 terminated with signal SIGSEGV"

Reproduce result:
https://github.com/JonathanSum/Hugging-Face-Course/blob/main/Ch3_A_full_training_TPU.ipynb

Issue: Adding new tokens to bert tokenizer in QA

WARNING: This issue is a replica of this other issue open by me, I ask you sorry if I have open it in the wrong place.

Hello Huggingface's team (@sgugger , @joeddav, @LysandreJik)
I have a problem with this code base
notebooks/examples/question_answering.ipynb - link
ENV: Google Colab - transformers Version: 4.5.0; datasets Version: 1.5.0; torch Version: 1.8.1+cu101;
I am trying to add some domain tokens in the bert-base-cased tokenizer

model_checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
list_of_domain_tokens = ["token1", "token2", "token3"]
tokenizer.add_tokens(list_of_domain_tokens)
...
...
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
print(model.device)  # cpu
model.resize_token_embeddings(len(tokenizer))
trainer = Trainer(...)

Then during the trainer.fit() call it report the attached error.
Can you please tell me where I'm wrong?
The tokenizer output is the usual bert inputs expressed in the form of List[List[int]] eg inputs_ids and attention_mask.
So I can't figure out where the problem is with the device

Kind Regards,
Andrea

Course Chapter 7, Section 4 Translation "TF Model does not exist".

The TensorFlow model "Helsinki-NLP/opus-mt-en-fr" does not exist and therefore produces an error while executing the notebook.

Steps to Reproduce

Visit Course Chapter 7 Section 4 for Tensorflow
Click on "Open in Colab" from the top right.
Run all
Observe error while execution of the cell

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Observed Error

404 Client Error: Not Found for url: https://huggingface.co/Helsinki-NLP/opus-mt-en-fr/resolve/main/tf_model.h5
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1555                     use_auth_token=use_auth_token,
-> 1556                     user_agent=user_agent,
   1557                 )

5 frames
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/Helsinki-NLP/opus-mt-en-fr/resolve/main/tf_model.h5

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1564                     f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a file named one of {TF2_WEIGHTS_NAME}, {WEIGHTS_NAME}.\n\n"
   1565                 )
-> 1566                 raise EnvironmentError(msg)
   1567             if resolved_archive_file == archive_file:
   1568                 logger.info(f"loading weights file {archive_file}")

OSError: Can't load weights for 'Helsinki-NLP/opus-mt-en-fr'. Make sure that:

- 'Helsinki-NLP/opus-mt-en-fr' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure 'Helsinki-NLP/opus-mt-en-fr' is not a path to a local directory with something else, in that case)

- or 'Helsinki-NLP/opus-mt-en-fr' is the correct path to a directory containing a file named one of tf_model.h5, pytorch_model.bin.

Reason for Error

Tensorflow model file is missing from the Hugging Face model hub.

Error using 'sacrebleu'

I got:

AttributeError: module 'sacrebleu' has no attribute 'DEFAULT_TOKENIZER

when I tried to run "metric = load_metric("sacrebleu")" in "translation.ipynb"

I think sacrebleu version should be specified.

How can I use model.save() ?

Hello, I use 'language_modeling-tf.ipynb notebook', and I train a model 'Masked language modeling' after finishing my training I try to save my model with. model.save(/content/drive/MyDrive/bert_pre') but I get this error:

ValueError Traceback (most recent call last)
in ()
----> 1 model.save("/content/drive/MyDrive/bert_pre")

1 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/tracking/data_structures.py in _checkpoint_dependencies(self)
875 "dictionary checkpointed, wrap it in a "
876 "non-trackable object; it will be subsequently ignored." % (
--> 877 self, self, self._self_last_wrapped_dict_snapshot))
878 assert not self._dirty # Any reason for dirtiness should have an exception.
879 return super(_DictWrapper, self)._checkpoint_dependencies

ValueError: Unable to save the object {'loss': <function dummy_loss at 0x7f9e28a0a830>, 'logits': None} (a dictionary wrapper constructed automatically on attribute assignment). The wrapped dictionary was modified outside the wrapper (its final value was {'loss': <function dummy_loss at 0x7f9e28a0a830>, 'logits': None}, its value when a checkpoint dependency was added was None), which breaks restoration on object creation.

If you don't need this dictionary checkpointed, wrap it in a non-trackable object; it will be subsequently ignored.

How can I save my model?

huggingface / notebooks Goto Github PK

notebooks's Introduction

notebooks

notebooks's People

Contributors

Stargazers

Watchers

Forkers

notebooks's Issues

Invoking script with the following command:

Description

Files

Solution (based on my observation)

Configuration

Parameters, Hyperparameters

Versions

Experiments

Summary

Entire Stack trace

Description (same as 52)

Why another issue?

Files

Solution

Description

Reason for Error

Recommend Projects

Recommend Topics

Recommend Org