luyug / condenser Goto Github PK

View Code? Open in Web Editor NEW

244.0 6.0 23.0 47 KB

EMNLP 2021 - Pre-training architectures for dense retrieval

License: Apache License 2.0

Python 100.00%

dense-retrieval bert transformer pytorch

condenser's People

Contributors

Stargazers

Watchers

condenser's Issues

Resources and time required for pre-training

Thank you for your excellent work！Can you share that how much resources and time did you spend on pre-training condenser and cocondenser? And what's the batch and epoch were used?

Have you tried condenser pretraining on RoBERTa ?

I pretrained a condeser-roberta-base on the same data and hyperparameters, but the results on downstream tasks were not high.

Have you ever tried condenser pretraining on RoBERTa-base ?

Thank you

Whole word masking for RoBERTa

Can you elaborate on why the first token is appended as an integer instead of [i] in line 65?
If the first word is being separated by BPE, this seems to be resulting in an uncaught exception for the following token.

Condenser/data.py

Line 61 in de9c257

cand_indexes.append(0)

results on leaderboard

Hi,
The dev result of coCondenser on MSMARCO-Passage-Ranking-Submissions leaderboard is 0.443. Is it the results on Large size model ? Thank you @luyug

What is exactly is CONdition of Condensor?

As title, I already read paper but still confused about the Condition of CLS token.
What is exactly is CONdition of Condensor?

Will the rer-rank code be released?

Or could you describe your rerank set up like the learning rate, batch size, and how to choose negatives? Thanks a lot!

Unable to resume CoCondenser pretraining

The model checkpoints seem to be hard-coded as the BertForMaskedLM and are unable to load but to the CoCondensor class.
Adding the following attributes in the initialization can surpass the exceptions but all the weights were not loaded.

self._keys_to_ignore_on_save = None
self._keys_to_ignore_on_load_missing = None

Is there a way to resume training after interruptions?
Thanks!

coCondenser pre-training code release

the coCondenser pre-training code be released this week? thanks

Cannot read json file for run_pre_training.py

Hi. Thank you for your great project. But I encounter a problem and I cannot fix it.
I run file helper/create_train.py and produce a json file with format:

{"text": [432, 17738, 6713, 5237, 2925, 30225, 10838, 4848, 1713, 23, 455, 376, 386, 5883, 4, 2870, 4582, 1218, 9569, 4331, 432, 12, 303, 1500, 3959, 15, 9, 454, 4331, 490, 11, 1389, 34376, 1384, 4, 1389, 181, 63, 4026, 2608, 35, 432, 163, 7, 761, 19154, 59480, 28463]}
{"text": [432, 19126, 3269, 1766, 2792, 32059, 10838, 17738, 1383, 29, 71, 303, 1387, 56630, 4, 1494, 21505, 4, 32384, 1231, 718, 1362, 452, 181, 176, 189, 10, 4331, 41, 1391, 1766, 2792, 525, 1750, 2697, 35, 4439, 1607, 24, 386, 5883, 4, 2870, 4582, 4331, 6, 311, 5089, 34, 9, 40406, 2870, 4331, 151, 69, 452, 316, 5191, 124, 4331, 14157, 3959, 15, 21, 316, 5191, 102, 10, 40406, 2870, 36793, 37272, 4, 26, 10, 441, 2697, 1500, 39, 181, 1555, 682, 72, 3959, 15, 454, 490, 10, 4331, 14157, 53, 97, 328, 2135, 8, 2792, 525, 386, 5883, 4, 2870, 4582, 36793, 15831, 19126, 7838, 525, 386, 5883, 4, 2870, 4582, 4331, 11, 1391, 1766, 302, 9, 15012, 1384, 4, 91, 32384, 1231, 718, 1362, 452, 181, 176, 40, 4436, 2608, 302, 59035, 65, 39, 3160, 30, 3974, 44654, 4331, 302, 740, 3160, 15012, 1384, 65, 205, 226, 10, 39, 445, 30, 890, 31485, 1384, 4, 12, 37, 10, 3160, 226, 676, 6, 3160, 151, 3974, 226, 33099, 5, 6663, 713, 302, 1430, 226, 386, 5883, 4, 2870, 4582, 4331, 11, 45247, 40210, 35, 40406, 2870, 4331, 187, 4439, 509, 11, 1389, 13394, 10838, 16939, 251, 1494, 21505, 4, 1189, 10, 229, 1188, 33869, 16580, 1487, 4, 1300, 363, 29986, 1581, 4, 34128, 718, 1362, 452, 181, 176, 10, 4623, 4436, 2608, 49, 5665, 9324, 143, 302, 1430, 13, 226, 386, 5883, 4, 2870, 4582, 36793, 5]}
...

When I use this file for run_pre_training.py, I encounter error

───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json. │
│ py:152 in _generate_tables                                                   │
│                                                                              │
│   149 │   │   │   │   │   │   except pa.ArrowInvalid as e:                   │
│   150 │   │   │   │   │   │   │   try:                                       │
│   151 │   │   │   │   │   │   │   │   with open(file, encoding="utf-8") as f │
│ ❱ 152 │   │   │   │   │   │   │   │   │   dataset = json.load(f)             │
│   153 │   │   │   │   │   │   │   except json.JSONDecodeError:               │
│   154 │   │   │   │   │   │   │   │   logger.error(f"Failed to read file '{f │
│   155 │   │   │   │   │   │   │   │   raise e                                │
│                                                                              │
│ /opt/conda/lib/python3.10/json/__init__.py:293 in load                       │
│                                                                              │
│   290 │   To use a custom ``JSONDecoder`` subclass, specify it with the ``cl │
│   291 │   kwarg; otherwise ``JSONDecoder`` is used.                          │
│   292 │   """                                                                │
│ ❱ 293 │   return loads(fp.read(),                                            │
│   294 │   │   cls=cls, object_hook=object_hook,                              │
│   295 │   │   parse_float=parse_float, parse_int=parse_int,                  │
│   296 │   │   parse_constant=parse_constant, object_pairs_hook=object_pairs_ │
│                                                                              │
│ /opt/conda/lib/python3.10/json/__init__.py:346 in loads                      │
│                                                                              │
│   343 │   if (cls is None and object_hook is None and                        │
│   344 │   │   │   parse_int is None and parse_float is None and              │
│   345 │   │   │   parse_constant is None and object_pairs_hook is None and n │
│ ❱ 346 │   │   return _default_decoder.decode(s)                              │
│   347 │   if cls is None:                                                    │
│   348 │   │   cls = JSONDecoder                                              │
│   349 │   if object_hook is not None:                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/json/decoder.py:337 in decode                      │
│                                                                              │
│   334 │   │   containing a JSON document).                                   │
│   335 │   │                                                                  │
│   336 │   │   """                                                            │
│ ❱ 337 │   │   obj, end = self.raw_decode(s, idx=_w(s, 0).end())              │
│   338 │   │   end = _w(s, end).end()                                         │
│   339 │   │   if end != len(s):                                              │
│   340 │   │   │   raise JSONDecodeError("Extra data", s, end)                │
│                                                                              │
│ /opt/conda/lib/python3.10/json/decoder.py:355 in raw_decode                  │
│                                                                              │
│   352 │   │   try:                                                           │
│   353 │   │   │   obj, end = self.scan_once(s, idx)                          │
│   354 │   │   except StopIteration as err:                                   │
│ ❱ 355 │   │   │   raise JSONDecodeError("Expecting value", s, err.value) fro │
│   356 │   │   return obj, end                                                │
│   357                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1860 in          │
│ _prepare_split_single                                                        │
│                                                                              │
│   1857 │   │   │   )                                                         │
│   1858 │   │   │   try:                                                      │
│   1859 │   │   │   │   _time = time.time()                                   │
│ ❱ 1860 │   │   │   │   for _, table in generator:                            │
│   1861 │   │   │   │   │   if max_shard_size is not None and writer._num_byt │
│   1862 │   │   │   │   │   │   num_examples, num_bytes = writer.finalize()   │
│   1863 │   │   │   │   │   │   writer.close()                                │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json. │
│ py:155 in _generate_tables                                                   │
│                                                                              │
│   152 │   │   │   │   │   │   │   │   │   dataset = json.load(f)             │
│   153 │   │   │   │   │   │   │   except json.JSONDecodeError:               │
│   154 │   │   │   │   │   │   │   │   logger.error(f"Failed to read file '{f │
│ ❱ 155 │   │   │   │   │   │   │   │   raise e                                │
│   156 │   │   │   │   │   │   │   # If possible, parse the file as a list of │
│   157 │   │   │   │   │   │   │   if isinstance(dataset, list):  # list is t │
│   158 │   │   │   │   │   │   │   │   try:                                   │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json. │
│ py:131 in _generate_tables                                                   │
│                                                                              │
│   128 │   │   │   │   │   │   try:                                           │
│   129 │   │   │   │   │   │   │   while True:                                │
│   130 │   │   │   │   │   │   │   │   try:                                   │
│ ❱ 131 │   │   │   │   │   │   │   │   │   pa_table = paj.read_json(          │
│   132 │   │   │   │   │   │   │   │   │   │   io.BytesIO(batch), read_option │
│   133 │   │   │   │   │   │   │   │   │   )                                  │
│   134 │   │   │   │   │   │   │   │   │   break                              │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/pyarrow/_json.pyx:259 in                       │
│ pyarrow._json.read_json                                                      │
│                                                                              │
│ [Errno 2] No such file or directory:                                         │
│ '/kaggle/working/zalo_ltr_2021/pyarrow/_json.pyx'                            │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/pyarrow/error.pxi:144 in                       │
│ pyarrow.lib.pyarrow_internal_check_status                                    │
│                                                                              │
│ [Errno 2] No such file or directory:                                         │
│ '/kaggle/working/zalo_ltr_2021/pyarrow/error.pxi'                            │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/pyarrow/error.pxi:100 in                       │
│ pyarrow.lib.check_status                                                     │
│                                                                              │
│ [Errno 2] No such file or directory:                                         │
│ '/kaggle/working/zalo_ltr_2021/pyarrow/error.pxi'                            │
╰──────────────────────────────────────────────────────────────────────────────╯
ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /kaggle/working/zalo_ltr_2021/Condenser/run_pre_training.py:202 in <module>  │
│                                                                              │
│   199                                                                        │
│   200                                                                        │
│   201 if __name__ == "__main__":                                             │
│ ❱ 202 │   main()                                                             │
│   203                                                                        │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/Condenser/run_pre_training.py:95 in main       │
│                                                                              │
│    92 │   # Set seed before initializing model.                              │
│    93 │   set_seed(training_args.seed)                                       │
│    94 │                                                                      │
│ ❱  95 │   train_set = load_dataset(                                          │
│    96 │   │   'json',                                                        │
│    97 │   │   data_files=data_args.train_path,                               │
│    98 │   │   block_size=2**25,                                              │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1782 in             │
│ load_dataset                                                                 │
│                                                                              │
│   1779 │   try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES          │
│   1780 │                                                                     │
│   1781 │   # Download and prepare data                                       │
│ ❱ 1782 │   builder_instance.download_and_prepare(                            │
│   1783 │   │   download_config=download_config,                              │
│   1784 │   │   download_mode=download_mode,                                  │
│   1785 │   │   verification_mode=verification_mode,                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:872 in           │
│ download_and_prepare                                                         │
│                                                                              │
│    869 │   │   │   │   │   │   │   prepare_split_kwargs["max_shard_size"] =  │
│    870 │   │   │   │   │   │   if num_proc is not None:                      │
│    871 │   │   │   │   │   │   │   prepare_split_kwargs["num_proc"] = num_pr │
│ ❱  872 │   │   │   │   │   │   self._download_and_prepare(                   │
│    873 │   │   │   │   │   │   │   dl_manager=dl_manager,                    │
│    874 │   │   │   │   │   │   │   verification_mode=verification_mode,      │
│    875 │   │   │   │   │   │   │   **prepare_split_kwargs,                   │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:967 in           │
│ _download_and_prepare                                                        │
│                                                                              │
│    964 │   │   │                                                             │
│    965 │   │   │   try:                                                      │
│    966 │   │   │   │   # Prepare split will record examples associated to th │
│ ❱  967 │   │   │   │   self._prepare_split(split_generator, **prepare_split_ │
│    968 │   │   │   except OSError as e:                                      │
│    969 │   │   │   │   raise OSError(                                        │
│    970 │   │   │   │   │   "Cannot find data file. "                         │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1749 in          │
│ _prepare_split                                                               │
│                                                                              │
│   1746 │   │   │   gen_kwargs = split_generator.gen_kwargs                   │
│   1747 │   │   │   job_id = 0                                                │
│   1748 │   │   │   with pbar:                                                │
│ ❱ 1749 │   │   │   │   for job_id, done, content in self._prepare_split_sing │
│   1750 │   │   │   │   │   gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_ │
│   1751 │   │   │   │   ):                                                    │
│   1752 │   │   │   │   │   if done:                                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1892 in          │
│ _prepare_split_single                                                        │
│                                                                              │
│   1889 │   │   │   # Ignore the writer's error for no examples written to th │
│   1890 │   │   │   if isinstance(e, SchemaInferenceError) and e.__context__  │
│   1891 │   │   │   │   e = e.__context__                                     │
│ ❱ 1892 │   │   │   raise DatasetGenerationError("An error occurred while gen │
│   1893 │   │                                                                 │
│   1894 │   │   yield job_id, True, (total_num_examples, total_num_bytes, wri │
│   1895                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
DatasetGenerationError: An error occurred while generating the dataset

I guess possibly the problem is incompatible version of datasets and transformers. But I tried many versions of datasets, I still encounter error. Can you help me fix this? Thank you so much!

Error when continuing train from pretrained model

Hi,

Thank you for your great work.

I try to run your code on my machine:

torch 1.8.1 transformers: 4.9.2

but face the error below.

Traceback (most recent call last):
File "run_pre_training.py", line 202, in
main()
File "run_pre_training.py", line 172, in main
trainer.train(model_path=model_path)
File "/home/dzge/.conda/envs/workspace/lib/python3.8/site-packages/transformers/trainer.py", line 1072, in train
self._load_state_dict_in_model(state_dict)
File "/home/dzge/.conda/envs/workspace/lib/python3.8/site-packages/transformers/trainer.py", line 1412, in _load_state_dict_in_model
if set(load_result.missing_keys) == set(self.model._keys_to_ignore_on_save):
File "/home/dzge/.conda/envs/workspace/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CondenserForPretraining' object has no attribute '_keys_to_ignore_on_save'

I cannot continue training with either bert-base-uncased or Luyu/condenser model. I already downloaded it and put in a local folder.

Thank you.

The relative weight of the MLM loss compared to the contrastive loss

In the paper, Equation 7 indicates that both the MLM and contrastive losses are divided by the effective batch size, whose value would be equal to 2 * per_device_train_batch_size * world_size. But the MLM loss calculation code seems to divide the MLM loss by per_device_train_batch_size * world_size (line 227), since the CoCondenserDataset's __getitem__ method returns two spans belonging to the same document, thereby making the actual batch dimension larger by a factor of 2.

I feel like I am missing something. Could you please help me out?

Condenser/modeling.py

Lines 219 to 230 in de9c257

    
           loss = self.mlm_loss(hiddens, labels) 
        
           if self.model_args.late_mlm: 
        
               loss += lm_out.loss 
        
           if grad_cache is None: 
        
               co_loss = self.compute_contrastive_loss(co_cls_hiddens) 
        
               return loss + co_loss 
        
           else: 
        
               loss = loss * (float(hiddens.size(0)) / self.train_args.per_device_train_batch_size) 
        
               cached_grads = grad_cache[chunk_offset: chunk_offset + co_cls_hiddens.size(0)] 
        
               surrogate = torch.dot(cached_grads.flatten(), co_cls_hiddens.flatten()) 
        
               return loss, surrogate

Condenser/data.py

Lines 177 to 179 in de9c257

    
           def __getitem__(self, item): 
        
               spans = self.dataset[item]['spans'] 
        
               return random.sample(spans, 2)

api feature

Hi Luyu,

Thank you for publishing this great work and the tools. Do you have/plans to make apis for Condenser, so that one can call your api in a case-wise style with (query, passages) and get the top passage/rank list/ranking scores?

ICT Pretrained Model

Hi Luyu,

Could you guide where we can find the ICT pretrained model to replicate these results?

Thanks

failed to reproduce the condenser pretraining results on V100

I am trying to reproduce the codenser pretraining results. I evaluate the checkpoint on the sts-b task with sentence-transformer, but the results are different.
（1）bert-base-uncased
2022-01-03 17:07:01 - Load pretrained SentenceTransformer: output/training_stsbenchmark_bert-base-uncased-2022-01-03_17-04-06
2022-01-03 17:07:02 - Use pytorch device: cuda
2022-01-03 17:07:02 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 17:07:05 - Cosine-Similarity : Pearson: 0.8484 Spearman: 0.8419
2022-01-03 17:07:05 - Manhattan-Distance: Pearson: 0.8345 Spearman: 0.8322
2022-01-03 17:07:05 - Euclidean-Distance: Pearson: 0.8349 Spearman: 0.8328
2022-01-03 17:07:05 - Dot-Product-Similarity: Pearson: 0.7521 Spearman: 0.7421

（2） Luyu/condenser
2022-01-03 17:12:46 - Load pretrained SentenceTransformer: output/training_stsbenchmark_Luyu-condenser-2022-01-03_17-09-51
2022-01-03 17:12:48 - Use pytorch device: cuda
2022-01-03 17:12:48 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 17:12:50 - Cosine-Similarity : Pearson: 0.8528 Spearman: 0.8504
2022-01-03 17:12:50 - Manhattan-Distance: Pearson: 0.8394 Spearman: 0.8380
2022-01-03 17:12:50 - Euclidean-Distance: Pearson: 0.8396 Spearman: 0.8378
2022-01-03 17:12:50 - Dot-Product-Similarity: Pearson: 0.7942 Spearman: 0.7819

（3）self-trained checkpoints
2022-01-03 17:34:30 - Load pretrained SentenceTransformer: output/training_stsbenchmark_output--2022-01-03_17-31-48
2022-01-03 17:34:32 - Use pytorch device: cuda
2022-01-03 17:34:32 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 17:34:34 - Cosine-Similarity : Pearson: 0.8498 Spearman: 0.8469
2022-01-03 17:34:34 - Manhattan-Distance: Pearson: 0.8415 Spearman: 0.8396
2022-01-03 17:34:34 - Euclidean-Distance: Pearson: 0.8423 Spearman: 0.8402
2022-01-03 17:34:34 - Dot-Product-Similarity: Pearson: 0.7959 Spearman: 0.7826

I run the pretraining on 8x32G V100 with the following settings:

python -m torch.distributed.launch --nproc_per_node 8 run_pre_training.py
--output_dir output
--model_name_or_path bert-base-uncased
--do_train
--save_steps 20000
--per_device_train_batch_size 128
--gradient_accumulation_steps 1
--fp16
--warmup_ratio 0.1
--learning_rate 1e-4
--num_train_epochs 8
--overwrite_output_dir
--dataloader_num_workers 16
--n_head_layers 2
--skip_from 6
--max_seq_length 128
--train_dir data
--weight_decay 0.01
--late_mlm

I use per_device_train_batch_size =128 and the global_batch_size = 128 x 8 = 1024.
The pre-training data is bookcorpus + wikipedia , created with released code by nvidia

raw dara:
5.0G bookscorpus_one_book_per_line.txt
13G wikicorpus_en_one_article_per_line.txt

after being preprocessed:
24G book_wiki.json
containing 41420334 lines with maxlen=128

I used the data to train bert-large and was able to reach F1=90% on squad task, so I think the corpus should be fine..

Will you please provide me some suggestions ? thank you

What are the exact library versions installed during this github code run?

I tried to run your code but the result is the same

I would like to ask what are the versions corresponding to the packages you have run:
pytorch
transformers
datasets
nltk

transfer to msmarco document dataset

Hi~
I am using this repo to do experiment on msmacro document dataset, but i feel a little confuse about the difference between repos of Condenser, tevatron and coCondenser. I follow the guide of "coCondenser MS-MARCO Passage Retrieval" and try to transfer the data to msmacro document dataset and the checkpoint to condenser. I think if i want reproduce the result of the coCondenser paper, i just need to encode and then Index Search? is that right? If i want to transfer the data to marco document and the condenser checkpoint, i need to follow the steps of finetuning stage one and two? first finetune a checkpoint and save to retriever_model_s1/ and then use the trained checkpoint to mining hard negatives and then use the hard negatives to further finetune the model and save to retriever_model_s2, and finally search the result of dev set? is that right

cocondenser-marco pretrainning data

Hi,
I am trying to reproduce cocondenser on msmarco data. But I got 37.4 on msmarco-dev task，will you help me ?
The msmarco data is downloaded from https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz (22 GB), and I extracted spans with the following code:

`def encode_one(line):
spans = nltk.sent_tokenize(line.strip())
if len(spans) < 2:
    return None
tokenized = [
    tokenizer(
        s,
        add_special_tokens=False,
        truncation=False,
        return_attention_mask=False,
        return_token_type_ids=False,
    )["input_ids"] for s in spans
]
tokenized = [span for span in tokenized if len(span) > 0]
return json.dumps({'spans': tokenized})`

The hyperparameters are following (on 16x 3090)

python -m torch.distributed.launch --nnodes=2 --node_rank=$1 --master_addr 10.104.91.11 --master_port 2222 --nproc_per_node 8 run_co_pre_training.py
--output_dir coco_msmarco_output_1e-4_bs2048
--model_name_or_path Luyu/condenser
--do_train
--fp16
--save_steps 2000
--save_total_limit 10
--model_type bert
--per_device_train_batch_size 128
--gradient_accumulation_steps 1
--warmup_ratio 0.1
--learning_rate 1e-4
--num_train_epochs 8
--dataloader_drop_last
--overwrite_output_dir
--dataloader_num_workers 32
--n_head_layers 2
--skip_from 6
--max_seq_length 128
--train_path pretrain_data/msmarco/msmarco.json
--weight_decay 0.01
--late_mlm
--cache_chunk_size 32

At the end of training, logs:
{'loss': 13.2283, 'learning_rate': 1.5294203533362537e-05, 'epoch': 6.9}
{'loss': 13.2219, 'learning_rate': 1.0731493648707841e-05, 'epoch': 7.23}
{'loss': 13.2096, 'learning_rate': 6.168783764053146e-06, 'epoch': 7.56}
{'loss': 13.1718, 'learning_rate': 1.6060738793984525e-06, 'epoch': 7.88}

On msmarco dev:
MRR @10: 0.3741504639104922
QueriesRanked: 6980
recall@1: 0.251432664756447
recall@50: 0.6607449856733524
recall@all: 0.6607449856733524
#####################

Hi, i got this problem when pre-training Condenser. I try several version of transformer to fix but it didn't work. Can you help me please

Detected call of `lr_scheduler.step()` before `optimizer.step()

Hi, I run run_pre_training.py and have this problem: "UserWarning: Detected call of lr_scheduler.step() before optimizer.step()". I looked for optimize and scheduler in your repo but didn't find anything. Can you help me to fix this. I'm using transformers 4.3.1

hi, I can't find the open QA examples and pre-processing code for the fine-tuning

CLS of which layers to use in Condenser? last layer CLS? sum of last four layers CLS?

Hi
Thanks for the nice repo. After pretraining, Condenser has the same architecture as BERT (condenser heads are removed). Which CLS layers worked best for neural IR? last layer CLS? the sum of the last four layers CLS? ....

About the title of the Wikipedia during pretraining

Hi Luyu, @luyug

Thanks for your interesting work!

My question is how the title is used during your pre-training, especially in coCondenser pre-training?

Is it just a span in the list of spans?

{'spans': List[str]} ...

Regarding the spans in the contrastive loss calculation

Hello,

In the paper it is stated that

... given a random list of n documents [d1, d2, ..., dn], we extract randomly from each a pair of spans, [s11, s12, ..., sn1, sn2].

I was wondering how the spans were extracted from a document. Are they sentences, each of which is split by nltk.sentence_tokenizer? Or, are they equally sized chunks extracted using a sliding window? Maybe they are the same as the Condenser pretraining blocks but annotated with a document id to which they belong?

Thank you.

How to resume_from_checkpoint

Hi, I'm newbie in Condenser. I use colab to run code and it can only run 24 hours max. I didn't find any resume_from_checkpoint in your code. How can i continue my training after 24 hours

reproducing your results on MS MARCO

Hi,

Thank you for your great work!
I am willing to replicate your results on MS MARCO passage collection and I have a question regarding Luyu/co-condenser-marco model. Is this the final model that you used to retrieve documents? Or do I need to train it on MS MARCO relevant query/passage pairs?
Is it possible to provide a little bit more detail on how should I use your dense toolkit with this model?

Thank you in advance!

	loss = self.mlm_loss(hiddens, labels)
	if self.model_args.late_mlm:
	loss += lm_out.loss

	if grad_cache is None:
	co_loss = self.compute_contrastive_loss(co_cls_hiddens)
	return loss + co_loss
	else:
	loss = loss * (float(hiddens.size(0)) / self.train_args.per_device_train_batch_size)
	cached_grads = grad_cache[chunk_offset: chunk_offset + co_cls_hiddens.size(0)]
	surrogate = torch.dot(cached_grads.flatten(), co_cls_hiddens.flatten())
	return loss, surrogate

	def __getitem__(self, item):
	spans = self.dataset[item]['spans']
	return random.sample(spans, 2)

luyug / condenser Goto Github PK

condenser's People

Contributors

Stargazers

Watchers

Forkers

condenser's Issues

Recommend Projects

Recommend Topics

Recommend Org