kamalkraj / albert-tf2.0 Goto Github PK

View Code? Open in Web Editor NEW

199.0 9.0 45.0 241 KB

ALBERT model Pretraining and Fine Tuning using TF2.0

License: Apache License 2.0

Python 100.00%

tf2 albert-tf2 multi-gpu-training fine-tuning albert tensoflow xla tf-hub glue squad

albert-tf2.0's Introduction

Hi there 👋

albert-tf2.0's People

Contributors

Stargazers

Watchers

Forkers

xw-jia hellogithubcomeon pentadecagon dogydev napoler jackalhan nguyenquananhminh chiragsanghvi10 ganeshgs raoqiyu ejmejm cnzjhdx zawecha1 duxm apohllo peterbul chong-z zepzep newzsh personx000 gustavo32 super973 ngoctanle cyjj zhipengchen zbn123 puleon steamfeifei lekhanhh hansolai bhaskar24 nbcstevenchen chawonseok lyonleelpl fspanda qiyanglin minikuro ssusantachary tubbz-alt park-ing-lot yoviyovi ethan-9606 asitison iq-scm 4lack

albert-tf2.0's Issues

Multi GPU finetuning

I have a question regarding your experiment finetuning for SQuAd 2.0 with 4x Titan RTX 24 GB. How long was the total training time? I´m running the same experiment with 8x Tesla V100 16 GB which according to my calculations takes about 200 hrs. I was expecting much lower training time with 8 GPUs.

python albert-tf2/run_squad.py
--mode=train_and_predict
--input_meta_data_path=${OUTPUT_DIR}/squad_${SQUAD_VERSION}meta_data
--train_data_path=${OUTPUT_DIR}/squad${SQUAD_VERSION}_train.tf_record
--predict_file=${SQUAD_DIR}/dev-${SQUAD_VERSION}.json
--albert_config_file=${ALBERT_DIR}/config.json
--init_checkpoint=${ALBERT_DIR}/tf2_model.h5
--spm_model_file=${ALBERT_DIR}/30k-clean.model
--train_batch_size=32
--predict_batch_size=32
--learning_rate=1.5e-5
--num_train_epochs=3
--model_dir=${OUTPUT_DIR}
--strategy_type=mirror
--version_2_with_negative
--max_seq_length=384

Thanks in advance!

same tf2_model.h5 (v1 & v2, for xxlarge)?

it appears both xxlarge v1 and v2 files (tf2_model.h5) are same size after download: 890.4MB, and same creation date:

v1: https://drive.google.com/file/d/1gl5lOiAHq29C_sG6GoXLeZJHKDD2Gfju/view
v2: https://drive.google.com/file/d/1JtQcGKtt0QZThXS1jz2v5x72TrYYjg8N/view

should they be different?

Pre-training on GPUs seems to be stuck

I have tried to perform pre-training from scratch on GPUs using the following command:
python run_pretraining.py --albert_config_file=albert_config.json --do_train --input_files=/somewhere/*/tf_examples.*.tfrecord --meta_data_file_path=/somewhere/train_meta_data --output_dir=/somewhere --strategy_type=mirror --train_batch_size=128 --num_train_epochs=2

But it seems to be stuck as follows:

...
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1209 00:48:14.076103 139679391237952 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I1209 00:48:24.566839 139679391237952 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I1209 00:48:45.377745 139679391237952 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2019-12-09 00:49:16.104345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

GPUs are running, but no outputs are found.

The core of pre-training code is similar to the following tensorflow BERT code, and I have succeeded in running the following pre-training code.
https://github.com/tensorflow/models/tree/master/official/nlp/bert

My environment is as follows:

tensorflow-gpu==2.0.0
CUDA 10.0

Thanks in advance.

run_classifier.py Error COLA

Running the cola script returns:

2020-01-15 17:53:21.504699: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2020-01-15 17:53:21.505194: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-01-15 17:53:21.518577: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3599910000 Hz
2020-01-15 17:53:21.519665: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3c2f130 executing computations on platform Host. Devices:
2020-01-15 17:53:21.519701: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
  File "run_classifer.py", line 457, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_classifer.py", line 307, in main
    loss_multiplier=loss_multiplier)
  File "run_classifer.py", line 195, in get_model
    pooled_output, _ = albert_layer(input_word_ids, input_mask, input_type_ids)
  File "/root/ALBERT-TF2.0/albert.py", line 212, in __call__
    return super(AlbertModel, self).__call__(inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 842, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
RuntimeError: in converted code:

    /root/ALBERT-TF2.0/albert.py:229 call  *
        word_embeddings = self.embedding_lookup(input_word_ids)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py:817 __call__
        self._maybe_build(inputs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py:2141 _maybe_build
        self.build(input_shapes)
    /root/ALBERT-TF2.0/albert.py:273 build
        dtype=self.dtype)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py:522 add_weight
        aggregation=aggregation)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/base.py:744 _add_variable_with_custom_getter
        **kwargs_for_getter)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer_utils.py:139 make_variable
        shape=variable_shape if variable_shape else None)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py:258 __call__
        return cls._variable_v1_call(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py:219 _variable_v1_call
        shape=shape)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py:65 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py:1322 creator_with_resource_vars
        return self._create_variable(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/one_device_strategy.py:262 _create_variable
        return next_creator(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py:197 <lambda>
        previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py:2507 default_variable_creator
        shape=shape)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py:262 __call__
        return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1406 __init__
        distribute_strategy=distribute_strategy)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1537 _init_from_args
        initial_value() if init_from_fn else initial_value,
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer_utils.py:119 <lambda>
        init_val = lambda: initializer(shape, dtype=dtype)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/init_ops_v2.py:343 __call__
        self.stddev, dtype)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/init_ops_v2.py:809 truncated_normal
        shape=shape, mean=mean, stddev=stddev, dtype=dtype, seed=self.seed)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/random_ops.py:171 truncated_normal
        mean_tensor = ops.convert_to_tensor(mean, dtype=dtype, name="mean")
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1184 convert_to_tensor
        return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1242 convert_to_tensor_v2
        as_ref=False)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1296 internal_convert_to_tensor
        ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_conversion_registry.py:52 _default_conversion_function
        return constant_op.constant(value, dtype, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py:227 constant
        allow_broadcast=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py:235 _constant_impl
        t = convert_to_eager_tensor(value, ctx, dtype)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/constant_op.py:96 convert_to_eager_tensor
        return ops.EagerTensor(value, ctx.device_name, dtype)

    RuntimeError: /job:localhost/replica:0/task:0/device:GPU:0 unknown device.

SQuAD predictions fail

@kamalkraj Thank you for giving clear instructions with respect to SQuAD which wasn't available in the main repo

With the following parameters:

run_squad.py --mode=predict \
--albert_config_file=../albert_base_resources/config.json \
--model_dir=../albert_base_resources/ \
--input_meta_data_path=../squad_out_v1.1/squad_v1.1_meta_data \
--predict_file=../squad_dataset/dev_small.json.txt \
--spm_model_file=../albert_base_resources/vocab/30k-clean.model

dev_small.json.txt

Fails with object referred before assigned

So, shouldn't this line:

ALBERT-TF2.0/run_squad.py

Lines 565 to 567 in 8d0cc21

    
           if FLAGS.version_2_with_negative: 
        
               get_raw_results = get_raw_results_v2 
        
           for result in get_raw_results(predictions):

modified to:

        if FLAGS.version_2_with_negative:
            predicted = get_raw_results_v2(predictions)
        else:
            predicted = get_raw_results(predictions)
        for result in predicted:

"lamp" should be "lamb"

ALBERT-TF2.0/run_pretraining.py

Line 174 in 8d0cc21

if FLAGS.optimizer == "lamp":

Weights

Could you give some more info about the weights linked here? Trained on English corpus only? As the original article?

You write the last layers is not available. This would probably mean they can not be used for additional domain specific pre-training, right? What would be required for doing this?

tf.saved_model.save and predict a single value

In order to save the model, I have added this line after the training loop:
tf.saved_model.save(model, os.path.join(FLAGS.output_dir, "1") )
in order to get: assets, saved_model.pb and variables

from there, I am trying to load the model and predict a single value:

loaded = tf.saved_model.load( os.path.join(model_dir, "1") )

tokenizer = tokenization.FullTokenizer(vocab_file=None,spm_model_file=spm_model_file, do_lower_case=True)

text_a = "the movie was not good"
example = classifier_data_lib.InputExample(guid=0, text_a=text_a, text_b=None, label=0)

labels = [0, 1]
max_seq_length = 128

feature = classifier_data_lib.convert_single_example(ex_index=0, example=example, label_list=labels, max_seq_length=max_seq_length, tokenizer=tokenizer)

test_input_word_ids =tf.convert_to_tensor([feature.input_ids], dtype=tf.int32, name='input_word_ids')
test_input_mask     =tf.convert_to_tensor([feature.input_mask], dtype=tf.int32, name='input_mask')
test_input_type_ids =tf.convert_to_tensor([feature.segment_ids], dtype=tf.int32, name='input_type_ids')

logit = loaded.signatures["serving_default"]( input_mask=test_input_mask,input_type_ids=test_input_type_ids,input_word_ids=test_input_word_ids )

pred = tf.argmax(logit['output'], axis=-1, output_type=tf.int32)
prob = tf.nn.softmax(logit['output'], axis=-1)

print(f'Prediction: {pred} Probabilities: {prob}')

This solution works for a single value. Thanks

training parameter

when you run squad 2.0 training, you used default value for "version_2_with_negative", which is false, should you be using "True" instead?

Fine Tune Error

Running the cola returns:

FATAL Flags parsing error: flag --classification_task_name=CoLA: value should be one of <COLA|STS|SST|MNLI|QNLI|QQP|RTE|MRPC|WNLI|XNLI>

How to continue the finetuning?

Hi,

I have finetuned the base_2 model on squad2.0 for 3 epochs. Now I would like to continue the training process for another several epochs, and when I run the training instruction, the training process immediately ended.
What option should I add to continue the finetuning?

Thanks!

Pre-training using GPUs is strange

I am trying pre-training from scratch using GPUs in Japanese, but the pre-training seems strange.
In the following log, masked_lm_accuracy and sentence_order_accuracy suddenly dropped.

..
I1211 00:37:45.981178 139995264022336 model_training_utils.py:346] Train Step: 45595/273570  / loss = 0.8961147665977478  masked_lm_accuracy = 0.397345  lm_example_loss = 2.636538  sentence_order_accuracy = 0.772450  sentence_order_mean_loss = 0.425534
I1211 14:28:47.512063 139995264022336 model_training_utils.py:346] Train Step: 91190/273570  / loss = 0.7142021656036377  masked_lm_accuracy = 0.454914  lm_example_loss = 2.074183  sentence_order_accuracy = 0.810986  sentence_order_mean_loss = 0.372746
I1212 04:19:05.215945 139995264022336 model_training_utils.py:346] Train Step: 136785/273570  / loss = 1.9355322122573853  masked_lm_accuracy = 0.062883  lm_example_loss = 5.900585  sentence_order_accuracy = 0.572066  sentence_order_mean_loss = 0.668080
..

Has someone succeeded in pre-training from scratch?

Unused weights from saved model

I run convert.py to convert albert tensorhub model to TF2.0 model with following commands

MODEL_DIR=albert-base
SIZE=base

# Converting weights to TF 2.0
python converter.py --tf_hub_path=${MODEL_DIR}/ --model_type=albert_encoder --version=2 --model=${SIZE}
# Copy albert_config.json to config.json
cp ${MODEL_DIR}/assets/albert_config.json ${MODEL_DIR}/config.json
# Rename assets to vocab
mv ${MODEL_DIR}/assets/ ${MODEL_DIR}/vocab

however, at the end of converting, it shows following messages

Done loading 25 ALBERT weights from: pretrain/albert-base-v2// into <albert.AlbertModel object at 0x7f393e172b00> (prefix:albert). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Unused weights from saved model: 
	cls/predictions/output_bias
	cls/predictions/transform/LayerNorm/beta
	cls/predictions/transform/LayerNorm/gamma
	cls/predictions/transform/dense/bias
	cls/predictions/transform/dense/kernel

Is this message showed convert success?

metrics about CoLA

In readme, performance of CoLA task is defined by accuracy. But this task always measured by Matthew correlation. Does it mean the say thing by just calling Matthew_corr as accuracy?

https://gluebenchmark.com/tasks

run pretraining in TPU

how can i run pretraining in TPU ?

Training from scratch in another language

I want to train AlBert from scratch in a non-English language. I have access to a corpus of 1-2 B words. Would that be sufficient?

Would training on one single Cloud TPU v3 with 128Gb RAM be feasible? Can you give an estimated training time for base, large and xlarge?

Create Fine-Tuning Data

What's the command to create fine-tuning data?
Example?

do_predict?

Hi,

Can the script do predict? I may miss it, but I didn't see a "do_pred" flag.

Gradients do not exist for variables

I have tried to run run_classifer.py, it works well on GPU, then I have made a little fix, it runs well on TPU. However, when I tried to run run_squad.py, I met this bug on GPU and TPU:

Model was constructed with shape Tensor("unique_ids:0", shape=(None, 1), dtype=int32) for input (None, 1), but it was re-called on a Tensor with incompatible shape (None,).
WARNING:tensorflow:Gradients do not exist for variables ['albert_model/pooler_transform/kernel:0', 'albert_model/pooler_transform/bias:0'] when minimizing the loss.
W1211 11:49:55.828053 139666686506368 optimizer_v2.py:1043] Gradients do not exist for variables ['albert_model/pooler_transform/kernel:0', 'albert_model/pooler_transform/bias:0'] when minimizing the loss.
WARNING:tensorflow:Model was constructed with shape Tensor("unique_ids:0", shape=(None, 1), dtype=int32) for input (None, 1), but it was re-called on a Tensor with incompatible shape (None,).
W1211 11:49:59.275795 139666686506368 network.py:847] Model was constructed with shape Tensor("unique_ids:0", shape=(None, 1), dtype=int32) for input (None, 1), but it was re-called on a Tensor with incompatible shape (None,).
WARNING:tensorflow:Gradients do not exist for variables ['albert_model/pooler_transform/kernel:0', 'albert_model/pooler_transform/bias:0'] when minimizing the loss.
W1211 11:50:02.947960 139666686506368 optimizer_v2.py:1043] Gradients do not exist for variables ['albert_model/pooler_transform/kernel:0', 'albert_model/pooler_transform/bias:0'] when minimizing the loss.

ALBERT Fine tuned on SQUAD 2 and REST API for inference

Hi,
Can you plz share ALBERT xxlarge model fine-tuned on SQUAD 2 and if possible a REST API same as your previous BERT-SQUAD (https://github.com/kamalkraj/BERT-SQuAD) or at least the inference code where the program takes paragraph and question as input and return the answer as output.

I have limited resources and time to fine-tune the model. So requesting you to share the same.

Thank You!!!

Getting huge number training steps

I have generated pretraining data using the given steps in this repo.
I am doing this for the Hindi language with 22gb of data. Generating pretraining data itself took 1 month!
So I have meta_data file associated with each tf.record file. I have added all the train_data_size values from all the meta_data files to make one meta_data file because in run_pretraining.py requires it. So my final meta_data file which looks something like this:

{
    "task_type": "albert_pretraining",
    "train_data_size": 596972848,
    "max_seq_length": 512,
    "max_predictions_per_seq": 20
}

Here number of training steps are calculated as below:

num_train_steps = int(total_train_examples / train_batch_size) * num_train_epochs

So total_train_examples is 596972848 hence I am getting num_train_steps to be 9327700 with batch size of 64 and with 1 epoch only. I saw that in readme here num_train_steps=125000. I am not getting whats went wrong here.

With such huge train steps, it will take forever to train Albert. Even if I make batch size to 512 with 1 epoch only the training step will be 1165962 which is still huge!
As Albert was trained on very huge data why there are only 125000 steps only?
Want to know-how many epochs are there in Albert training for English?

Can anyone suggest what went wrong and what should I do now?

AssertionError

Hi @kamalkraj Thank you for the previous fix.
I am working on STS-B data set and I am executing the following commands in Ubuntu

export GLUE_DIR=glue_data
export ALBERT_DIR=model_configs/large
export TASK_NAME=STS
export OUTPUT_DIR=stsb_processed
mkdir $OUTPUT_DIR
export MODEL_DIR=output_stsb

python run_classifer.py \
--train_data_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record \
--eval_data_path=${OUTPUT_DIR}/${TASK_NAME}_eval.tf_record \
--input_meta_data_path=${OUTPUT_DIR}/${TASK_NAME}_meta_data \
--albert_config_file=${ALBERT_DIR}/config.json \
--task_name=${TASK_NAME} \
--spm_model_file=${ALBERT_DIR}/vocab/30k-clean.model \
--output_dir=${MODEL_DIR} \
--init_checkpoint=${ALBERT_DIR}/tf2_model.h5 \
--do_train \
--do_eval \
--train_batch_size=16 \
--learning_rate=1e-5 \
--custom_training_loop

Error message :

I1209 13:14:37.739436 140685254485824 run_classifer.py:306] ***** Running training *****
I1209 13:14:37.739539 140685254485824 run_classifer.py:307] Num examples = 5749
I1209 13:14:37.739591 140685254485824 run_classifer.py:308] Batch size = 16
I1209 13:14:37.739633 140685254485824 run_classifer.py:309] Num steps = 1077
Traceback (most recent call last):
File "run_classifer.py", line 452, in
app.run(main)
File "/home/vv/venvv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/vv/venvv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "run_classifer.py", line 355, in main
custom_callbacks = custom_callbacks)
File "/home/vv/git/ALBERT-TF2.0/model_training_utils.py", line 155, in run_customized_training_loop
assert tf.executing_eagerly()
AssertionError

Any idea on the same?

How to fine-tune ALBERT for NER

Is it possible to do this and could you please, if possible, provide some general instructions?

Thanks in anticipation.

KeyError: 5.0

Hi, I am working on STS-B data set and I am executing the following commands in Ubuntu

`export GLUE_DIR=glue_data
export ALBERT_DIR=model_configs/large

export TASK_NAME=STS
export OUTPUT_DIR=stsb_processed
mkdir $OUTPUT_DIR

python create_finetuning_data.py
--input_data_dir=${GLUE_DIR}/
--spm_model_file=${ALBERT_DIR}/vocab/30k-clean.model
--train_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record
--eval_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_eval.tf_record
--meta_data_file_path=${OUTPUT_DIR}/${TASK_NAME}_meta_data
--fine_tuning_task_type=classification --max_seq_length=128
--classification_task_name=${TASK_NAME}
`

Error message:

I1206 14:39:44.645808 139799230306112 classifier_data_lib.py:761] Writing example 0 of 5749 Traceback (most recent call last): File "create_finetuning_data.py", line 149, in <module> app.run(main) File "/home/chirag/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/chirag/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "create_finetuning_data.py", line 137, in main input_meta_data = generate_classifier_dataset() File "create_finetuning_data.py", line 122, in generate_classifier_dataset do_lower_case=FLAGS.do_lower_case) File "/home/chirag/git/ALBERT-TF2.0/classifier_data_lib.py", line 835, in generate_tf_record_from_data_file train_data_output_path) File "/home/chirag/git/ALBERT-TF2.0/classifier_data_lib.py", line 764, in file_based_convert_examples_to_features max_seq_length, tokenizer) File "/home/chirag/git/ALBERT-TF2.0/classifier_data_lib.py", line 732, in convert_single_example label_id = label_map[example.label] KeyError: 5.0

Any idea on why I am getting KeyError ? Thanks in advance.

Docker Containerization

Please containerize this model for fast fine tuning tasks.

Resuming training from tf checkpoints

After training a model for some epochs, how can I restore it and continue training from the checkpoints outputted as they are not in the hdf5 format?

Pretraining of albert from scratch is stuck

I am doing pre-training from scratch. It seems that training is started as gpu's are being used but nothing is on terminal except this:

***** Number of cores used :  4 
I0227 09:00:31.841020 140137372948224 run_pretraining.py:226] Training using customized training loop TF 2.0 with distrubutedstrategy.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:44.563593 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:44.569019 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:45.620952 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:45.625989 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.679141 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.684157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.734523 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.739573 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.697876 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.703157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:07.835676 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:28.672055 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2020-02-27 09:01:50.162839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

I tried on smaller text data also but same results.
@kamalkraj

how to inference online with tensorflow2.0?

i am trying to inference online with tensorflow2.0. my code is as follows:

    self.graph = tf.Graph()

    with self.graph.as_default() as g:
        self.input_ids = tf.compat.v1.placeholder(tf.int32, [FLAGS.batch_size,
                                                             FLAGS.max_seq_length], name="input_ids")
        self.input_mask = tf.compat.v1.placeholder(tf.int32, [FLAGS.batch_size,
                                                              FLAGS.max_seq_length], name="input_mask")
        self.p_mask = tf.compat.v1.placeholder(tf.float32, [FLAGS.batch_size,
                                                            FLAGS.max_seq_length], name="p_mask")
        self.segment_ids = tf.compat.v1.placeholder(tf.int32, [FLAGS.batch_size,
                                                               FLAGS.max_seq_length], name="segment_ids")
        self.cls_index = tf.compat.v1.placeholder(tf.int32, [FLAGS.batch_size], name="segment_ids")
        self.unique_ids = tf.compat.v1.placeholder(tf.int32, [FLAGS.batch_size], name="unique_ids")

        # unpacked_inputs = tf_utils.unpack_inputs(inputs)
        self.squad_model = ALBertQAModel(
            albert_config, FLAGS.max_seq_length, init_checkpoint, FLAGS.start_n_top, FLAGS.end_n_top,
            FLAGS.squad_dropout)

        learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(initial_learning_rate=1e-5,
                                                                         decay_steps=10000,
                                                                         end_learning_rate=0.0)
        optimizer_fn = AdamWeightDecay
        optimizer = optimizer_fn(
            learning_rate=learning_rate_fn,
            weight_decay_rate=0.01,
            beta_1=0.9,
            beta_2=0.999,
            epsilon=1e-6,
            exclude_from_weight_decay=['layer_norm', 'bias'])

        self.squad_model.optimizer = optimizer
        graph_init_op = tf.compat.v1.global_variables_initializer()

        y = self.squad_model(
            self.unique_ids, self.input_ids, self.input_mask, self.segment_ids, self.cls_index,
            self.p_mask, training=False)
        self.unique_ids, self.start_tlp, self.start_ti, self.end_tlp, self.end_ti, self.cls_logits = y

        self.sess = tf.compat.v1.Session(graph=self.graph, config=gpu_config)
        self.sess.run(graph_init_op)
        with self.sess.as_default() as sess:
            self.squad_model.load_weights(FLAGS.model_dir)

This code is executable, but it runs bad result. It looks like the parameters are unloaded.I guess this is probably because I'm not using tf.Session to set default parameters on the model, such as' saver.restore(sess, tf.train. Latest_checkpoint (init_checkpoint)) '.
I've tried several ways to do this, but it hasn't worked.And there are very few examples of online inferencing using tensorflow2.0 on the Internet, and I have trouble finding a solution. :((((
May i get some help here, thx very much!!

ALBERT model not learning

Hi There, I am having some issues getting the model to finetune.

I'm sort of confused and could use some help. Is there a forum I could ask for help?

The issue is that the model doesn't learn, it just stays at ~ 0.5 accuracy. (N.B. the output is 2 class dense)

Here's a sample output:

input_word_ids (InputLayer) [(None, 35)]
input_mask (InputLayer) [(None, 35)]
input_type_ids (InputLayer) [(None, 35)]

albert_model (AlbertModel) [(None, 1024)], (None 17683968)

                                                             input_word_ids[0][0]             
                                                             input_mask[0][0]                 
                                                             input_type_ids[0][0]

dropout (Dropout) (None, 1024) 0 albert_model[0][0]

output (Dense) (None, 2) 2050 dropout[0][0]

Total params: 17,686,018
Trainable params: 17,686,018
Non-trainable params: 0

I0416 20:14:06.850114 140122845333248 finetune.py:186] ***** Running training *****
I0416 20:14:06.850288 140122845333248 finetune.py:187] Num examples = 52500
I0416 20:14:06.850376 140122845333248 finetune.py:188] Batch size = 32
I0416 20:14:06.850451 140122845333248 finetune.py:189] Num steps = 32812
Train on 47261 samples, validate on 5252 samples
Epoch 1/20
2020-04-16 20:14:41.742967: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
4064/47261 [=>............................] - ETA: 25:16 - loss: 0.8179 - sparse_categorical_accuracy: 0.4783

save model issue in TF 2.0 (tf.saved_model.save)

I am new to TF 2.0, I tried to save model by " tf.saved_model.save(squad_m......", but always get errors, such as: " start_positions = inputs["start_positions"] KeyError: 'start_positions'". I am guessing this is because the use of subclassing of keras_model: "class ALBertQAModel(tf.keras.Model):" , could you confirm or help me understand if otherwise?
Thanks,
Jim

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,128,3072]

I am having and 1660Ti with 6GB memory but still when i check the usage of the GPU it is using only 2 to 4 % can you tell me why this is happening or is there a way i can make it use my GPU

I am running the following command for COLA example
python run_classifer.py --train_data_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record --eval_data_path=${OUTPUT_DIR}/${TASK_NAME}_eval.tf_record --input_meta_data_path=${OUTPUT_DIR}/${TASK_NAME}_meta_data --albert_config_file=${ALBERT_DIR}/config.json --task_name=${TASK_NAME} --spm_model_file=${ALBERT_DIR}/vocab/30k-clean.model --output_dir=${MODEL_DIR} --init_checkpoint=${ALBERT_DIR}/tf2_model.h5 --do_train --do_eval --train_batch_size=16 --learning_rate=1e-5 --custom_training_loop

I have also created the tf2.h5 model files using this link [https://github.com/kamalkraj/ALBERT-TF2.0/blob/master/converter.md]

but still iam getting an OOM error can yopu help in this :


Limit:                  2312241152
InUse:                  2299682816
MaxInUse:               2299704576
NumAllocs:                    1254
MaxAllocSize:             31680256

2020-01-17 12:39:30.108357: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************
2020-01-17 12:39:30.108429: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[16,128,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2020-01-17 12:39:30.108517: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[16,128,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node model_1/albert_model/encoder/shared_layer_10/intermediate/add}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last):
  File "run_classifer.py", line 452, in <module>
    app.run(main)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_classifer.py", line 355, in main
    custom_callbacks = custom_callbacks)
  File "/media/xxxx/NewVolume/ALBERT-TF2.0/model_training_utils.py", line 324, in run_customized_training_loop
    train_single_step(train_iterator)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
    return self._stateless_fn(*args, **kwds)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[16,128,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node model_1/albert_model/encoder/shared_layer_10/intermediate/add (defined at /media/xxxx/NewVolume/kamal/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_single_step_24488]

Function call stack:
train_single_step

Natural Question Data Prep Script

What's the equivalent of bert-joint nq dataset prep script for ALBERT?

https://github.com/google-research/language/blob/master/language/question_answering/bert_joint/prepare_nq_data.py

Suggestion on Readme.MD

It might be nice to, as a final step, show an instance of an actual inference on the model so a reader can "tie it all together". It isn't strictly useful, but for anyone who doesn't know a lot of the terminology it would bring it home.

Pretraining from scatch

Hi,
Thanks for your code :) It's very helpful for me to study ALBERT.
As long as I know ALBERT batch size is 4096 on the paper.
Have you ever tried to pretrain from scratch via GPU?
I've seen your guide for squad fine tuning but couldn't find any information about pretraining from scratch.
Please let me know if you have any info on that.

Unable to open file (unable to open file: name = 'large//tf2_model.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Where can i find the large//tf2_model.h5, i am trying to execute the example COLA for running the run_classifer.py file

@kamalkraj

Freezing layers during training

When training I see progress followed by degradation. This is (likely) because the model is over fitting due to the limited corpus size of 8k samples. What is happening is we are overwriting the pre-trained weights in the fine-tuning task. What we would like to do is freeze the original layers. We need to figure out how to do this.

Assertion error finetuning with run_classifier.py

I can't get past this error with run_classifier.py

AssertionError: Nothing except the root object matched a checkpointed value. Typically this means that the checkpoint does not match the Python program. The following objects have no matching checkpointed value: [MirroredVariable:{
0 /job:localhost/replica:0/task:0/device:GPU:0: <tf.Variable 'albert_model/encoder/shared_layer/self_attention/value/bias:0' shape=(1024,) dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)> ...

Below is my call of the script. I am only testing the workflow so I pretrained for 1 epoch. I made a custom task for my particular use case.

ALBERT_CONFIG=$HOME/idbd-bio-dev/top-binner-albert/data/configs/config_10mers_tf2_2.json
EVAL=$HOME/mnt/corpuses/finetune_corpus_10mers_test/fine_tune_tf_records/eval.tfrecord
TRAIN=$HOME/mnt/corpuses/finetune_corpus_10mers_test/fine_tune_tf_records/training.tfrecord
META=$HOME/mnt/corpuses/finetune_corpus_10mers_test/fine_tune_tf_records/metadata.txt
OUTPUT_DIR=$HOME/mnt/models/albert_finetune_10mer_15_len
INIT_CHKPNT=$HOME/mnt/models/albert_pretrain_10mer_tf2_15_len/ctl_step_31250.ckpt-1
VOCAB=$HOME/mnt/vocab/10mers.vocab
SPM_MODEL=$HOME/mnt/vocab/10mers.model

export PYTHONPATH=$PYTHONPATH:../../albert_tf2
cd ../../albert_tf2

python run_classifer.py \
--albert_config_file=$ALBERT_CONFIG \
--eval_data_path=$EVAL \
--input_meta_data_path=$META \
--train_data_path=$TRAIN \
--strategy_type=mirror \
--output_dir=$OUTPUT_DIR \
--vocab_file=$VOCAB \
--spm_model_file=$SPM_MODEL \
--do_train=True \
--do_eval=True \
--do_predict=False \
--max_seq_length=15 \
--optimizer=AdamW \
--task_name=GENOMIC \
--train_batch_size=32 \
--init_checkpoint=$INIT_CHKPNT

AttributeError: 'AdamWeightDecay' object has no attribute '_decayed_lr_t'

Hi,

Thanks for the code!
I'm trying to run the code on squad2.0 dataset. I used the Version 2 base model (Not the xxlarge one When I ran

python3 run_squad.py --mode=train_and_predict --input_meta_data_path=${OUTPUT_DIR}/squad_${SQUAD_VERSION}meta_data --train_data_path=${OUTPUT_DIR}/squad${SQUAD_VERSION}_train.tf_record --predict_file=${SQUAD_DIR}/dev-${SQUAD_VERSION}.json --albert_config_file=${ALBERT_DIR}/config.json --init_checkpoint=${ALBERT_DIR}/tf2_model.h5 --spm_model_file=${ALBERT_DIR}/vocab/30k-clean.model --train_batch_size=24 --predict_batch_size=24 --learning_rate=1.5e-5 --num_train_epochs=3 --model_dir=${OUTPUT_DIR} --strategy_type=mirror --version_2_with_negative --max_seq_length=384

An exception occurred,

AttributeError: 'AdamWeightDecay' object has no attribute '_decayed_lr_t'

i ran the code on Ubuntu 18.04, and my tensorflow versions are as follows

tb-nightly 1.14.0a20190603
tensorboard 1.14.0
tensorflow 2.0.0b1
tensorflow-estimator 1.14.0
tensorflow-gpu 2.0.0b1
tf-estimator-nightly 1.14.0.dev2019060501

Is there something wrong with the versions?
The detailed error information are as follows.

W1116 11:33:43.881644 140642252482304 optimizer_v2.py:979] Gradients does not exist for variables ['albert_model/pooler_transform/kernel:0', 'albert_model/pooler_transform/bias:0'] when minimizing the loss.
I1116 11:33:43.977660 140648276031296 coordinator.py:219] Error reported to Coordinator: 'AdamWeightDecay' object has no attribute '_decayed_lr_t'
Traceback (most recent call last):
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 189, in _call_for_each_replica
**merge_kwargs)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 476, in _distributed_apply
var, apply_grad_to_update_var, args=(grad,), group=False))
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1458, in update
return self._update(var, fn, args, kwargs, group)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 766, in _update
**values.select_device_mirrored(d, kwargs)))
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 460, in apply_grad_to_update_var
grad.values, var, grad.indices)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 850, in _resource_apply_sparse_duplicate_indices
return self._resource_apply_sparse(summed_grad, handle, unique_indices)
File "/home/cjy/Albert/ALBERT/optimization.py", line 168, in _resource_apply_sparse
var.device, var.dtype.base_dtype, apply_state)
File "/home/cjy/Albert/ALBERT/optimization.py", line 148, in _get_lr
return self._decayed_lr_t[var_dtype], {}
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 542, in getattribute
raise e
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 532, in getattribute
return super(OptimizerV2, self).getattribute(name)
AttributeError: 'AdamWeightDecay' object has no attribute '_decayed_lr_t'
Traceback (most recent call last):
File "run_squad.py", line 845, in
app.run(main)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_squad.py", line 837, in main
train_squad(strategy, input_meta_data)
File "run_squad.py", line 742, in train_squad
custom_callbacks=custom_callbacks)
File "/home/cjy/Albert/ALBERT/model_training_utils.py", line 328, in run_customized_training_loop
tf.convert_to_tensor(steps, dtype=tf.int32))
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 416, in call
self._initialize(args, kwds, add_initializers_to=initializer_map)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 359, in _initialize
*args, **kwds))
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1360, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1648, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1541, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 716, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 309, in wrapped_fn
return weak_wrapped_fn().wrapped(*args, **kwds)
File "/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 706, in wrapper
raise e.ag_error_metadata.to_exception(type(e))
AttributeError: in converted code:
/home/cjy/Albert/ALBERT/model_training_utils.py:239 train_steps  *
    strategy.experimental_run_v2(_replicated_step, args=(next(iterator),))
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:708 experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:1710 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py:708 _call_for_each_replica
    fn, args, kwargs)
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py:195 _call_for_each_replica
    coord.join(threads)
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py:389 join
    six.reraise(*self._exc_info_to_raise)
/usr/lib/python3/dist-packages/six.py:693 reraise
    raise value
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
    yield
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py:189 _call_for_each_replica
    **merge_kwargs)
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:476 _distributed_apply
    var, apply_grad_to_update_var, args=(grad,), group=False))
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:1458 update
    return self._update(var, fn, args, kwargs, group)
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py:766 _update
    **values.select_device_mirrored(d, kwargs)))
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:460 apply_grad_to_update_var
    grad.values, var, grad.indices)
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:850 _resource_apply_sparse_duplicate_indices
    return self._resource_apply_sparse(summed_grad, handle, unique_indices)
/home/cjy/Albert/ALBERT/optimization.py:168 _resource_apply_sparse
    var.device, var.dtype.base_dtype, apply_state)
/home/cjy/Albert/ALBERT/optimization.py:148 _get_lr
    return self._decayed_lr_t[var_dtype], {}
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:542 __getattribute__
    raise e
/home/cjy/.local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:532 __getattribute__
    return super(OptimizerV2, self).__getattribute__(name)

AttributeError: 'AdamWeightDecay' object has no attribute '_decayed_lr_t'`

	if FLAGS.version_2_with_negative:
	get_raw_results = get_raw_results_v2
	for result in get_raw_results(predictions):