Code Monkey home page Code Monkey logo

bigbird's Introduction

Big Bird: Transformers for Longer Sequences

Not an official Google product.

What is BigBird?

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle.

As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization.

More details and comparisons can be found in our presentation.

Citation

If you find this useful, please cite our NeurIPS 2020 paper:

@article{zaheer2020bigbird,
  title={Big bird: Transformers for longer sequences},
  author={Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

Code

The most important directory is core. There are three main files in core.

  • attention.py: Contains BigBird linear attention mechanism
  • encoder.py: Contains the main long sequence encoder stack
  • modeling.py: Contains packaged BERT and seq2seq transformer models with BigBird attention

Colab/IPython Notebook

A quick fine-tuning demonstration for text classification is provided in imdb.ipynb

Create GCP Instance

Please create a project first and create an instance in a zone which has quota as follows

gcloud compute instances create \
  bigbird \
  --zone=europe-west4-a \
  --machine-type=n1-standard-16 \
  --boot-disk-size=50GB \
  --image-project=ml-images \
  --image-family=tf-2-3-1 \
  --maintenance-policy TERMINATE \
  --restart-on-failure \
  --scopes=cloud-platform

gcloud compute tpus create \
  bigbird \
  --zone=europe-west4-a \
  --accelerator-type=v3-32 \
  --version=2.3.1

gcloud compute ssh --zone "europe-west4-a" "bigbird"

For illustration we used instance name bigbird and zone europe-west4-a, but feel free to change them. More details about creating Google Cloud TPU can be found in online documentations.

Instalation and checkpoints

git clone https://github.com/google-research/bigbird.git
cd bigbird
pip3 install -e .

You can find pretrained and fine-tuned checkpoints in our Google Cloud Storage Bucket.

Optionally, you can download them using gsutil as

mkdir -p bigbird/ckpt
gsutil cp -r gs://bigbird-transformer/ bigbird/ckpt/

The storage bucket contains:

  • pretrained BERT model for base(bigbr_base) and large (bigbr_large) size. It correspond to BERT/RoBERTa-like encoder only models. Following original BERT and RoBERTa implementation they are transformers with post-normalization, i.e. layer norm is happening after the attention layer. However, following Rothe et al, we can use them partially in encoder-decoder fashion by coupling the encoder and decoder parameters, as illustrated in bigbird/summarization/roberta_base.sh launch script.
  • pretrained Pegasus Encoder-Decoder Transformer in large size(bigbp_large). Again following original implementation of Pegasus, they are transformers with pre-normalization. They have full set of separate encoder-decoder weights. Also for long document summarization datasets, we have converted Pegasus checkpoints (model.ckpt-0) for each dataset and also provided fine-tuned checkpoints (model.ckpt-300000) which works on longer documents.
  • fine-tuned tf.SavedModel for long document summarization which can be directly be used for prediction and evaluation as illustrated in the colab nootebook.

Running Classification

For quickly starting with BigBird, one can start by running the classification experiment code in classifier directory. To run the code simply execute

export GCP_PROJECT_NAME=bigbird-project  # Replace by your project name
export GCP_EXP_BUCKET=gs://bigbird-transformer-training/  # Replace
sh -x bigbird/classifier/base_size.sh

Using BigBird Encoder instead BERT/RoBERTa

To directly use the encoder instead of say BERT model, we can use the following code.

from bigbird.core import modeling

bigb_encoder = modeling.BertModel(...)

It can easily replace BERT's encoder.

Alternatively, one can also try playing with layers of BigBird encoder

from bigbird.core import encoder

only_layers = encoder.EncoderStack(...)

Understanding Flags & Config

All the flags and config are explained in core/flags.py. Here we explain some of the important config paramaters.

attention_type is used to select the type of attention we would use. Setting it to block_sparse runs the BigBird attention module.

flags.DEFINE_enum(
    "attention_type", "block_sparse",
    ["original_full", "simulated_sparse", "block_sparse"],
    "Selecting attention implementation. "
    "'original_full': full attention from original bert. "
    "'simulated_sparse': simulated sparse attention. "
    "'block_sparse': blocked implementation of sparse attention.")

block_size is used to define the size of blocks, whereas num_rand_blocks is used to set the number of random blocks. The code currently uses window size of 3 blocks and 2 global blocks. The current code only supports static tensors.

Important points to note:

  • Hidden dimension should be divisible by the number of heads.
  • Currently the code only handles tensors of static shape as it is primarily designed for TPUs which only works with statically shaped tensors.
  • For sequene length less than 1024, using original_full is advised as there is no benefit in using sparse BigBird attention.

Comparisons

Recently, Long Range Arena provided a benchmark of six tasks that require longer context, and performed experiments to benchmark all existing long range transformers. The results are shown below. BigBird model, unlike its counterparts, clearly reduces memory consumption without sacrificing performance.

bigbird's People

Contributors

hori-ryota avatar ikarosilva avatar manzilz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigbird's Issues

Differences between ETC and BigBird-ETC version

@manzilz Thank you for sharing the excellent research. :)

I have two quick questions. If I missed some info in your paper, could you please let me know what I missed?

Q1. Is the Global-local attention method used in the BigBird-ETC version totally the same as the ETC paper, otherwise Longformer?
As I know, some special tokens(global tokens) only take full attention to the restricted sequences according to the ETC paper. For example, in the HotpotQA task, a paragraph token attends to all tokens within the paragraph. Also, a sentence token attends to all tokens within the sentence. ( I can't find about how [CLS] and question tokens take attention to. )

In Longformer, the special tokens between sentences take full attention to the context.

In BigBird paper(above of section 3), the author said

"we add g global tokens that attend to all existing tokens."

It seems to say the BigBird-ETC version is similar to Longformer. However, when the author mentioned differences between Longformer and BigBird-ETC, point to the reference as an ETC (in Appendix E.3). It makes me confused.

Q2. Is there a source code or a pre-trained model for the BigBird-ETC version? If you could share it used in your paper, I will really appreciate it!

I look forward to your response.

Pre-trained model for genomic sequences

Good morning,

Thank you for sharing the paper, code and pre-trained model for NLP text data. Your research work results are impressive. Because I am developing embeddings solutions for genes and proteins, the application to genomic sequences part interests me the most.

Is there any chance to try BigBird nucleotide-based pre-trained model for research purpose? I would like to include it in my benchmark and compare it with existing non-contextual embeddings (Word2Vec, FastText and Glove).

Regards,
Piotr

Error in PubMed evaluation using run_summarization.py

I am using the script roberta_base.sh to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.

I0416 18:16:41.567906 139788890330944 error_handling.py:115] evaluation_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0416 18:16:41.568143 139788890330944 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "bigbird/summarization/run_summarization.py", line 534, in <module>
    app.run(main)
...
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2268, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:0", shape=(), dtype=float32, device=/job:worker/task:0/device:CPU:0)

I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:

python3 bigbird/summarization/run_summarization.py \
  --data_dir="tfds://scientific_papers/pubmed" \
  --output_dir=gs://bigbird-replication-bucket/summarization/pubmed \
  --attention_type=block_sparse \
  --couple_encoder_decoder=True \
  --max_encoder_length=3072 \
  --max_decoder_length=256 \
  --num_attention_heads=12 \
  --num_hidden_layers=12 \
  --hidden_size=768 \
  --intermediate_size=3072 \
  --block_size=64 \
  --train_batch_size=2 \
  --eval_batch_size=4 \
  --num_train_steps=1000 \
  --do_train=True \
  --do_eval=True \
  --use_tpu=True \
  --tpu_name=bigbird \
  --tpu_zone=us-central1-b \
  --gcp_project=bigbird-replication \
  --num_tpu_cores=8 \
  --save_checkpoints_steps=1000 \
  --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0

Pegasus variables mapping

I have my own pretrained Pegasus model, now I want to finetune using BigBird, so this is my mapping function,

import re
import collections

def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
    """Compute the union of the current variables and checkpoint variables."""
    assignment_map = {}
    initialized_variable_names = {}

    name_to_variable = collections.OrderedDict()
    for var in tvars:
        name = var.name
        m = re.match('^(.*):\\d+$', name)
        if m is not None:
            name = m.group(1)
        name_to_variable[name] = var

    init_vars = tf.train.list_variables(init_checkpoint)
    assignment_map = collections.OrderedDict()
    for x in init_vars:
        (name, var) = (x[0], x[1])

        l = 'pegasus/' + name
        l = l.replace('embeddings/weights', 'embeddings/word_embeddings')
        l = l.replace('self/output', 'output')
        l = l.replace('ffn/dense_1', 'output/dense')
        l = l.replace('ffn', 'intermediate')
        l = l.replace('memory_attention/output', 'attention/encdec_output')
        l = l.replace('memory_attention', 'attention/encdec')

        if l not in name_to_variable:
            continue
        assignment_map[name] = name_to_variable[l]
        initialized_variable_names[l + ':0'] = 1

    return (assignment_map, initialized_variable_names)

output,

OrderedDict([('decoder/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_0/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_1/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_2/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_3/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_4/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_5/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('embeddings/weights',
              <tf.Variable 'pegasus/embeddings/word_embeddings:0' shape=(32128, 512) dtype=float32_ref>),
             ('encoder/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_0/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_1/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_2/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_3/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_4/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_5/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>)])

My pegasus config, Copy pasted from https://github.com/google-research/bigbird/blob/master/bigbird/summarization/pegasus_large.sh

bert_config = {
    # transformer basic configs
    'attention_probs_dropout_prob': 0.1,
    'hidden_act': 'relu',
    'hidden_dropout_prob': 0.1,
    'hidden_size': 512,
    'initializer_range': 0.02,
    'intermediate_size': 3072,
    'max_position_embeddings': 4096,
    'max_encoder_length': 2048,
    'max_decoder_length': 512,
    'num_attention_heads': 8,
    'num_hidden_layers': 6,
    'type_vocab_size': 2,
    'scope': 'pegasus',
    'use_bias': False,
    'rescale_embedding': True,
    'vocab_model_file': None,
    # sparse mask configs
    'attention_type': 'block_sparse',
    'norm_type': 'prenorm',
    'block_size': 64,
    'num_rand_blocks': 3,
    'vocab_size': 32128,
    'beam_size': 1,
    'alpha': 0.0,
    'couple_encoder_decoder': False,
    'num_warmup_steps': 10000,
    'learning_rate': 0.1,
    'label_smoothing': 0.1,
    'optimizer': 'Adafactor',
    'use_tpu': True,
}

Not sure this is the correct one, finetuning is really slow, so any guide about variable mapping is really helpful.

Preprocessing code for TriviaQA dataset

Dear authors,

Do you use the same preprocessing code as Longformer on TriviaQA dataset such as truncating each document less than 4096, answer string match algorithm and normalized aliases as training labels?

Variable error with the full_bigbird_mask method in the multi head attention class

There is a variable error with the full_bigbird_mask method in the multi-head attention class for the big bird mask that uses MAX_SEQ_LEN instead of from_sequence_length passed, this will affect the creation of attention_mask with the using the convert_attn_list_to_mask(self, rand_attn) method.
temp_mask = [ full_bigbird_mask( # pylint: disable=g-complex-comprehension self.from_seq_length, self.to_seq_length, self.from_block_size, self.to_block_size, rand_attn=rand_attn[i]) for i in range(self.num_attention_heads) ]
`def full_bigbird_mask(from_seq_length,
to_seq_length,
from_block_size,
to_block_size,
rand_attn):
"""Calculate BigBird attention pattern as a full dense matrix.

Args:
from_seq_length: int. length of from sequence.
to_seq_length: int. length of to sequence.
from_block_size: int. size of block in from sequence.
to_block_size: int. size of block in to sequence.
rand_attn: adjajency matrix for random attention.

Returns:
attention mask matrix of shape [from_seq_length, to_seq_length]
"""

attn_mask = np.zeros((MAX_SEQ_LEN, MAX_SEQ_LEN), dtype=np.int32)
for i in range(1, (MAX_SEQ_LEN // from_block_size) - 1):`
full_bird_mask method uses MAX_SEQ_LEN instead of from_seq_length or to_seq_length which does not make the method dynamic as MAX_SEQ_LEN is only defined at the top of the module and seems to be causing a glitch with the convert_attn_list_to_mask method.

code error in version of tensorflow?

Hello google research~
Thanks for code for big bird.
But i've got in trouble with the code.
Below is the error message in terminal

my python version is 3.9.7
and version of packages are in below.

if you have a answer for this problem, please let me know ..

---------------------------------------VERSION OF PACKAGES----------------------------------------------------

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 4.5 1_gnu
absl-py 0.12.0 pypi_0 pypi
aiohttp 3.8.1 pypi_0 pypi
aiosignal 1.2.0 pypi_0 pypi
ale-py 0.7.3 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
async-timeout 4.0.1 pypi_0 pypi
attrs 21.2.0 pypi_0 pypi
bigbird 0.0.1 pypi_0 pypi
blessings 1.7 py39h06a4308_1002
bz2file 0.98 pypi_0 pypi
ca-certificates 2021.10.26 h06a4308_2
cachetools 4.2.4 pypi_0 pypi
certifi 2021.10.8 py39h06a4308_0
charset-normalizer 2.0.8 pypi_0 pypi
chex 0.1.0 pypi_0 pypi
click 8.0.3 pypi_0 pypi
cloudpickle 2.0.0 pypi_0 pypi
configparser 5.1.0 pypi_0 pypi
cycler 0.11.0 pypi_0 pypi
datasets 1.16.1 pypi_0 pypi
decorator 5.1.0 pypi_0 pypi
dill 0.3.4 pypi_0 pypi
dm-tree 0.1.6 pypi_0 pypi
docker-pycreds 0.4.0 pypi_0 pypi
dopamine-rl 3.2.1 pypi_0 pypi
filelock 3.4.0 pypi_0 pypi
flask 2.0.2 pypi_0 pypi
flatbuffers 2.0 pypi_0 pypi
flax 0.3.6 pypi_0 pypi
fonttools 4.28.2 pypi_0 pypi
frozenlist 1.2.0 pypi_0 pypi
fsspec 2021.11.1 pypi_0 pypi
future 0.18.2 pypi_0 pypi
gast 0.4.0 pypi_0 pypi
gevent 21.8.0 pypi_0 pypi
gin-config 0.5.0 pypi_0 pypi
gitdb 4.0.9 pypi_0 pypi
gitpython 3.1.24 pypi_0 pypi
google-api-core 2.2.2 pypi_0 pypi
google-api-python-client 2.31.0 pypi_0 pypi
google-auth 2.3.3 pypi_0 pypi
google-auth-httplib2 0.1.0 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
googleapis-common-protos 1.53.0 pypi_0 pypi
gpustat 0.6.0 pyhd3eb1b0_1
greenlet 1.1.2 pypi_0 pypi
grpcio 1.42.0 pypi_0 pypi
gunicorn 20.1.0 pypi_0 pypi
gym 0.21.0 pypi_0 pypi
h5py 3.6.0 pypi_0 pypi
httplib2 0.20.2 pypi_0 pypi
huggingface-hub 0.2.0 pypi_0 pypi
idna 3.3 pypi_0 pypi
importlib-metadata 4.8.2 pypi_0 pypi
importlib-resources 5.4.0 pypi_0 pypi
itsdangerous 2.0.1 pypi_0 pypi
jax 0.2.25 pypi_0 pypi
jaxlib 0.1.74 pypi_0 pypi
jinja2 3.0.3 pypi_0 pypi
joblib 1.1.0 pypi_0 pypi
keras 2.7.0 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
kfac 0.2.0 pypi_0 pypi
kiwisolver 1.3.2 pypi_0 pypi
ld_impl_linux-64 2.35.1 h7274673_9
libclang 12.0.0 pypi_0 pypi
libffi 3.3 he6710b0_2
libgcc-ng 9.3.0 h5101ec6_17
libgomp 9.3.0 h5101ec6_17
libstdcxx-ng 9.3.0 hd4cf53a_17
markdown 3.3.6 pypi_0 pypi
markupsafe 2.0.1 pypi_0 pypi
matplotlib 3.5.0 pypi_0 pypi
mesh-tensorflow 0.1.19 pypi_0 pypi
mpmath 1.2.1 pypi_0 pypi
msgpack 1.0.3 pypi_0 pypi
multidict 5.2.0 pypi_0 pypi
multiprocess 0.70.12.2 pypi_0 pypi
natsort 8.0.0 pypi_0 pypi
ncurses 6.3 h7f8727e_2
nltk 3.6.5 pypi_0 pypi
numpy 1.21.4 pypi_0 pypi
nvidia-ml 7.352.0 pyhd3eb1b0_0
oauth2client 4.1.3 pypi_0 pypi
oauthlib 3.1.1 pypi_0 pypi
opencv-python 4.5.4.60 pypi_0 pypi
openssl 1.1.1l h7f8727e_0
opt-einsum 3.3.0 pypi_0 pypi
optax 0.1.0 pypi_0 pypi
optimizers 2.1 pypi_0 pypi
packaging 21.3 pypi_0 pypi
pandas 1.3.4 pypi_0 pypi
pathtools 0.1.2 pypi_0 pypi
pillow 8.4.0 pypi_0 pypi
pip 21.2.4 py39h06a4308_0
promise 2.3 pypi_0 pypi
protobuf 3.19.1 pypi_0 pypi
psutil 5.8.0 py39h27cfd23_1
pyarrow 6.0.1 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pygame 2.1.0 pypi_0 pypi
pyparsing 3.0.6 pypi_0 pypi
pypng 0.0.21 pypi_0 pypi
python 3.9.7 h12debd9_1
python-dateutil 2.8.2 pypi_0 pypi
pytz 2021.3 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
readline 8.1 h27cfd23_0
regex 2021.11.10 pypi_0 pypi
requests 2.26.0 pypi_0 pypi
requests-oauthlib 1.3.0 pypi_0 pypi
rouge-score 0.0.4 pypi_0 pypi
rsa 4.8 pypi_0 pypi
sacremoses 0.0.46 pypi_0 pypi
scikit-learn 1.0.1 pypi_0 pypi
scipy 1.7.3 pypi_0 pypi
sentencepiece 0.1.96 pypi_0 pypi
sentry-sdk 1.5.0 pypi_0 pypi
setuptools 58.0.4 py39h06a4308_0
setuptools-scm 6.3.2 pypi_0 pypi
shortuuid 1.0.8 pypi_0 pypi
six 1.16.0 pyhd3eb1b0_0
sklearn 0.0 pypi_0 pypi
smmap 5.0.0 pypi_0 pypi
sqlite 3.36.0 hc218d9a_0
subprocess32 3.5.4 pypi_0 pypi
sympy 1.9 pypi_0 pypi
tensor2tensor 1.15.7 pypi_0 pypi
tensorboard 2.7.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.0 pypi_0 pypi
tensorflow 2.7.0 pypi_0 pypi
tensorflow-addons 0.15.0 pypi_0 pypi
tensorflow-datasets 4.4.0 pypi_0 pypi
tensorflow-estimator 2.7.0 pypi_0 pypi
tensorflow-gan 2.1.0 pypi_0 pypi
tensorflow-hub 0.12.0 pypi_0 pypi
tensorflow-io-gcs-filesystem 0.22.0 pypi_0 pypi
tensorflow-metadata 1.5.0 pypi_0 pypi
tensorflow-probability 0.7.0 pypi_0 pypi
tensorflow-text 2.7.3 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
tf-slim 1.1.0 pypi_0 pypi
tfds-nightly 4.4.0.dev202112010107 pypi_0 pypi
threadpoolctl 3.0.0 pypi_0 pypi
tk 8.6.11 h1ccaba5_0
tokenizer 3.3.2 pypi_0 pypi
tokenizers 0.10.3 pypi_0 pypi
tomli 1.2.2 pypi_0 pypi
toolz 0.11.2 pypi_0 pypi
torch 1.10.0 pypi_0 pypi
tqdm 4.62.3 pypi_0 pypi
transformers 4.12.5 pypi_0 pypi
typeguard 2.13.2 pypi_0 pypi
typing-extensions 3.10.0.2 pypi_0 pypi
tzdata 2021e hda174b7_0
uritemplate 4.1.1 pypi_0 pypi
urllib3 1.26.7 pypi_0 pypi
wandb 0.12.7 pypi_0 pypi
werkzeug 2.0.2 pypi_0 pypi
wheel 0.37.0 pyhd3eb1b0_1
wrapt 1.13.3 pypi_0 pypi
xxhash 2.0.2 pypi_0 pypi
xz 5.2.5 h7b6447c_0
yarl 1.7.2 pypi_0 pypi
yaspin 2.1.0 pypi_0 pypi
zipp 3.6.0 pypi_0 pypi
zlib 1.2.11 h7b6447c_3
zope-event 4.5.0 pypi_0 pypi
zope-interface 5.4.0 pypi_0 pypi

---------------------------------------ERROR MESSAGE----------------------------------------------------
(bigbird) kjk88@gpu2:~/bigbird$ sh -x bigbird/classifier/base_size.sh

  • python3 bigbird/classifier/run_classifier.py --data_dir=tfds://imdb_reviews/plain_text --output_dir=gs://bigbird-transformer-training/classifier/imdb
    WARNING:tensorflow:From /home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    Instructions for updating:
    non-resource variables are not supported in the long term
    Traceback (most recent call last):
    File "/home/kjk88/bigbird/bigbird/classifier/run_classifier.py", line 460, in
    app.run(main)
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
    File "/home/kjk88/bigbird/bigbird/classifier/run_classifier.py", line 375, in main
    bert_config = flags.as_dictionary()
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/bigbird/core/flags.py", line 187, in as_dictionary
    FLAGS.vocab_model_file = str(importlib_resources.files(bigbird).joinpath(
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/importlib/resources.py", line 147, in files
    return _common.from_package(_get_package(package))
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/importlib/_common.py", line 14, in from_package
    return fallback_resources(package.spec)
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/importlib/_common.py", line 18, in fallback_resources
    package_directory = pathlib.Path(spec.origin).parent
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/pathlib.py", line 1082, in new
    self = cls._from_parts(args, init=False)
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/pathlib.py", line 707, in _from_parts
    drv, root, parts = self._parse_args(args)
    File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/pathlib.py", line 691, in _parse_args
    a = os.fspath(a)
    TypeError: expected str, bytes or os.PathLike object, not NoneType
  • --attention_type=block_sparse --max_encoder_length=4096 --num_attention_heads=12 --num_hidden_layers=12 --hidden_size=768 --intermediate_size=3072 --block_size=64 --train_batch_size=2 --eval_batch_size=2 --do_train=True --do_eval=True --use_tpu=True --tpu_name=bigbird --tpu_zone=europe-west4-a --gcp_project=bigbird-project
    bigbird/classifier/base_size.sh: 8: bigbird/classifier/base_size.sh: --attention_type=block_sparse: not found
  • --num_tpu_cores=32 --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0
    bigbird/classifier/base_size.sh: 24: bigbird/classifier/base_size.sh: --num_tpu_cores=32: not found

Precision equals Recall in run_classifier.py script run.

I am trying to replicate the results of the paper. I ran run_classifier.py script for 7000 train-steps on imdb reviews. After every 1000 batches, we see precision, recall, accuracy, F1 score and loss printed on the terminal. For all the checkpoints, precision=recall=F1=accuracy up to all decimal points. I wonder if this has some mistake in calculation. For a binary dataset, we should not have precision=recall=accuracy.

For e.g. for ckpt-1000, I got 0.9408210 as the values for p, r, a, f1.

Is it valid to train on GRCh38.p13 human reference instead of GRCh37 ?

Dear authors,

Thank you for this outstanding work!

I have a question regarding the reference genome for training genomic model.
In your paper you refer to GRCh37, but it seems that it is an outdated version now and Build 38 can be used (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39)
Do you think it will be valid to train BigBird model on chromosomes of GRCh38.p13 for chromatin profile prediction, considering that DeepSEA training dataset is based on GRCh37? Or is should be same reference genome GRCh37 in both datasets?

Learning rate mentioned in paper vs run_summarization.py

Hi ,

The learning rate mentioned in paper for summarization is around 3e-5 . But in the run_summarization.py it is mentioned as 0.32 ( default ) in the flags.
In roberta_base.sh script, there is no changing happen for the learning rate.

Can anyone please update on this, as learning rate is very crucial for models like these.

Thanks

TFDS Custom Dataset Issue - normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.

I am using BigBird with a custom dataset (essay, label) for classification. I successfully imported the dataset as a custom tfds dataset and the BigBird classifier runs but does not return any results as shown in the log below. In my_datset.py configuration file for tfds, I am using this code to define the text feature - 'text': tfds.features.Text(). However, I believe that I need to add an encoder but TensorFlow has deprecated this in tfds.features.Text and recommends using the new tensorflow_text but doesn't explain how to do this in tfds.features.Text. Can anyone provide a recommendation for how to encode the text so BigBird can perform the classification?

My GPUS are 0
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
{'label': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:0' shape=() dtype=int64>, 'text': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:1' shape=() dtype=string>}
Tensor("args_1:0", shape=(), dtype=string)
Tensor("args_0:0", shape=(), dtype=int64)

0%| | 0/199 [00:00<?, ?it/s]
42%|████▏ | 84/199 [00:00<00:00, 838.07it/s]
100%|██████████| 199/199 [00:00<00:00, 1124.10it/s]

0%| | 0/2000 [00:00<?, ?it/s]
0%| | 0/2000 [00:00<?, ?it/s]
{'label': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:0' shape=() dtype=int64>, 'text': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:1' shape=() dtype=string>}
Tensor("args_1:0", shape=(), dtype=string)
Tensor("args_0:0", shape=(), dtype=int64)

0it [00:00, ?it/s]
0it [00:00, ?it/s]
Loss = 0.0 Accuracy = 0.0

Couldn't able to save and load the model after finetuning

In bigbird summarization, I have loaded pretrained model , after that I have performed finetuning on gigaword tensorflow dataset , then I tried to save the model using tf.saved_model.save(model, data_dir=export_dir) and loaded the model using loaded_model = tf.keras.models.load_model("/drive/My Drive/Checkpoint_Summarization/original_saved") and it is throwing
ValueError: Found zero restored functions for caller function.

Export predictions for each example

I have successfully run Google's BigBird NLP on the IMDB dataset and also a custom dataset imported using tfds. BigBird's imdb.ipynb only prints the overall accuracy and loss. I'm trying to export the predictions for each record in the dataset and have been unable to find any information on how to do this. Any help is appreciated!

Here is the current code that I used for the summary metrics:
eval_loss = tf.keras.metrics.Mean(name='eval_loss')
eval_accuracy = tf.keras.metrics.CategoricalAccuracy(name='eval_accuracy')

opt = tf.keras.optimizers.Adam(FLAGS.learning_rate)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.CategoricalAccuracy(name='train_accuracy')

for i, ex in enumerate(tqdm(dataset.take(FLAGS.num_train_steps), position=0)):
loss, log_probs, grads = fwd_bwd(ex[0], ex[1])
opt.apply_gradients(zip(grads, model.trainable_weights+headl.trainable_weights))
train_loss(loss)
train_accuracy(tf.one_hot(ex[1], 2), log_probs)
if i% 200 == 0:
print('Loss = {} Accuracy = {}'.format(train_loss.result().numpy(), train_accuracy.result().numpy()))

Model for genomic sequences

Hi
I could not find pretrained model for the genomic sequences task , neither I could find the script (training algo , tokenizer) which I could use to train my own model for the mlm task for genomic sequences.

Unconditional assert False in bigbird/core/utils.py

Hi,

I wanted to point out that in bigbird/core/utils.py at line 58, there is an unconditional assert False:

assert False, "Static shape not available for {}".format(tensor)

However, there is code after the assert statment. If I'm not mistaken, that means it is dead code:

  assert False, "Static shape not available for {}".format(tensor)

  dyn_shape = tf.shape(tensor)
  for index in non_static_indexes:
    shape[index] = dyn_shape[index]
  return shape

Question about pre-trained weights

Thanks so much for releasing BigBird!

Quick question about the pre-trained weights. Do the bigbr_large and bigbr_base correspond to BERT-like encoder-only checkpoints and bigbp_large to the encoder-decoder version?

reproduce arxiv classification task

We try to reproduce arxiv task with f1 92 as shown in the paper, we are using default hyperparameters defined in bigbird/classifier/base_size.sh, pretrained checkpoint here, but with batch size = 2 due to memory limitation (total batch size = 8gpu * 2 = 16), after 16k steps (16000 * 16 / 30034 = 8.5 epoch), but only get f1 84 in the end, which is too low compare to the paper which is trained by 10 epochs.
Did we missing something? preprocessing of Arxiv? or just because of the batch size is too small?
Will you release the checkpoint of Arxiv in the future?

About the difference of dataset, we have finetune roberta on the same arxiv dataset and get f1 86, pretty close the the paper.

I want to know d.map("preprocess function",... ) processing

I had try to debug "do_making" function of "run_pretraining.py" file for using pycharm IDE

But don't stop break point for over function

My tensorflow version "tensorflow-2.4.0"
Pycharm debug app is "pydev"

I have some two question

Q.1 Why not working break point in this function
Q.2 How about that to solve this problem

Problem image
Screen Shot 2021-01-11 at 2 29 44 AM

thank you

Error in run_classifier.py for attention_type=simulated_sparse

I am using script base_size.sh to run the class run_classifier.py. I am able to train and evaluate on imdb data for attention_type set as original_full and block_sparse but when I set it to simulated_sparse I see errors in initializing the training itself. The 12 layers are initialized but training doesn't start. The major error log is below:

File "/home/amitghattimare/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3211, in _as_graph_def
    graph.ParseFromString(compat.as_bytes(data))
google.protobuf.message.DecodeError: Error parsing message

I used the below script to run the code in case it helps in investigation. If I change attention_type to the other 2 options, it works fine. I am using only 8 cores because that's the max available in preemptible mode. I have reduced train_batch_size so that it fits in memory. I wonder if that's causing the issue though error logs don't indicate that.

python3 bigbird/classifier/run_classifier.py \
  --data_dir=tfds://imdb_reviews/plain_text \
  --output_dir=gs://bigbird-replication-bucket/classifier/imdb/sim_sparse_attention \
  --attention_type=simulated_sparse \
  --max_encoder_length=4096 \
  --num_attention_heads=12 \
  --num_hidden_layers=12 \
  --hidden_size=768 \
  --intermediate_size=3072 \
  --block_size=64 \
  --train_batch_size=1 \
  --eval_batch_size=2 \
  --do_train=True \
  --do_eval=False \
  --num_train_steps=1000 \
  --use_tpu=True \
  --tpu_name=bigbird \
  --tpu_zone=us-central1-b \
  --gcp_project=bigbird-replication \
  --num_tpu_cores=8 \
  --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0

How can we finetune the pretrained model using tfrecord files?

I've tried to finetune the model on my own text summarization dataset. Before doing that, I tested using tfrecord as the input file. So I put /tmp/bigb/tfds/aeslc/1.0.0 as data_dir:

flags.DEFINE_string(
    "data_dir", "/tmp/bigb/tfds/aeslc/1.0.0",
    "The input data dir. Should contain the TFRecord files. "
    "Can be TF Dataset with prefix tfds://")

Then I run run_summarization.py. But I got the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Feature: document (data type: string) is required but could not be found.
         [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]]
         [[Mean/_19475]]
  (1) Invalid argument: Feature: document (data type: string) is required but could not be found.
         [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]]

Could anyone advise me how to finetune the model using tfrecord as the input file?

Why is BigBird Pegasus/Pegasus Repeating the Same Sentence for Summarization?

Hello,

BigBird Pegaus, when creating summaries of text, is repeating the same sentence over and over. I have tried using text on the Hugging Face model hub and there is an issue posted on Stack Overflow (https://stackoverflow.com/questions/68911203/big-bird-pegasus-summarization-output-is-repeating-itself). Additionally, below are some images from the Hugging Face hub.

image

I am doing text summarization for my thesis and I am not sure why this is happening, but apparently it has been an issue for 6 months. Is there a way to prevent this from happening?

Thank you.

Roberta Training

Hello,

First, congratulations for your work.

Second, from what I have discovered so far, you only allow Bert like training and not Roberta training.
Even if the NSP is set to false, still your script requires the "next_sentence_labels" field which is generated by Bert script.

My question is:
How can we generator and train a model like Roberta, where there is only a single sequence per example without NSP.

@manzilz @ppham27 your feedback is highly appreciated.
Thanks in advance for your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.