Code Monkey home page Code Monkey logo

sgpt's Introduction

SGPT: GPT Sentence Embeddings for Semantic Search

This repository contains code, results & pre-trained models for the paper SGPT: GPT Sentence Embeddings for Semantic Search.

**************************** Updates ****************************

  • 2024-02: We released GRIT & GritLM - These models unify SGPT Bi-Encoders, Cross-Encoders, symmetric, asymmetric, and regular GPT (i.e. generation) all in 1 single model at much better performance on all accounts. We recommend switching to these new models :)
  • 2022-09: SGPT Bi-Encoders are now easy to use with Sentence Transformers, see new scripts
  • 2022-08: Multilingual BLOOM SGPT models were released: Asymmetric, 7.1B parameters & Symmetric, 1.7B parameters. Feel free to open an issue if you need a different model.
  • 2022-06: OpenAI released the mechanism of their Search Endpoint that we compared to SGPT Cross-Encoders in the paper. Our methods are very similar. Feel free to test their prompt as seen in crossencoder/beir/openai_search_endpoint_functionality.py!
  • 2022-03: 5.8B Bi-Encoder models are now 4% & 1% better on USEB & BEIR, respectively. Paper & models on HF have been updated. This has been done by using larger batch sizes with GradCache, see the paper for more info. If you have previously downloaded them, we recommend replacing it with the new version.
  • 2022-02: We released our paper. Check it out! :)

Quick Links

Overview

We present SGPT-BE and SGPT-CE for applying GPT models as Bi-Encoders or Cross-Encoders to symmetric or asymmetric search. SGPT-BE produces semantically meaningful sentence embeddings by contrastive fine-tuning of only bias tensors and position-weighted mean pooling. SGPT-CE uses log probabilities from GPT models without any fine-tuning. An illustration of the methods follows.

Feel free to open an issue should you have any questions~

Structure

.
├── biencoder  # Training & Inference of Bi-Encoders
│   ├── beir
│   │   ├── custommodels # Directory providing BEIR compatibility for asymmetric mdoels & models with special tokens
│   │   │   └── ...
│   │   ├── io_utils # Exclusively used for beir_openai_embeddings_batched_parallel.py
│   │   │   └── ...
│   │   ├── parallelizer # Exclusively used for beir_openai_embeddings_batched_parallel.py
│   │   │   └── ...
│   │   ├── beir_dense_retriever.py
│   │   ├── beir_openai_embeddings_batched_parallel.py
│   │   ├── requirements.txt
│   │   ├── *.bash # Bash scripts to run multiple experiments
│   │   └── README.md
│   ├── nli_msmarco
│   │   ├── sentence-transformers # An adapted version of sentence-transformers - Install this version for all biencoder experiments
│   │   │   └── ...
│   │   └── README.md
│   └── useb
│       ├── useb
│       │   └── ...
│       ├── *.bash # Bash scripts to run multiple experiments
│       ├── useb_dense_retriever.py
│       └── README.md
├── crossencoder  # Inference of Cross-Encoders
│   └── beir
│       ├── *.ipynb # Notebooks explained in the README
│       └── README.md
├── other
│   ├── sgpt_graphic.png
│   └── sgpt_utils.ipynb # Code for creating the graphs in the paper & other
├── requirements.txt
└── README.md

Each data sub-directory provides its own README with an overview of its Structure, Downloads (Datasets, Models) & Commands used to produce the datasets, models & other things. Generally, you can find all models at https://huggingface.co/Muennighoff and json results in various datasets at https://www.kaggle.com/muennighoff/datasets. Model names are explained in their Huggingface READMEs. Dataset names are explained in the sub-folders of this repository.

Use SGPT with Huggingface

Below we provide python examples to use the pre-trained models for your own semantic search use case. We highly recommend replacing the model names with larger models, e.g. Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit for biencoder/symmetric.

Bi-Encoder

Symmetric Semantic Search BE

import torch
from transformers import AutoModel, AutoTokenizer
from scipy.spatial.distance import cosine

# Get our models - The package will take care of downloading the models automatically
# For best performance: Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit
tokenizer = AutoTokenizer.from_pretrained("Muennighoff/SGPT-125M-weightedmean-nli-bitfit")
model = AutoModel.from_pretrained("Muennighoff/SGPT-125M-weightedmean-nli-bitfit")
# Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)
model.eval()

# Tokenize input texts
texts = [
    "deep learning",
    "artificial intelligence",
    "deep diving",
    "artificial snow",
]
batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    # Get hidden state of shape [bs, seq_len, hid_dim]
    last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state

# Get weights of shape [bs, seq_len, hid_dim]
weights = (
    torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
    .unsqueeze(0)
    .unsqueeze(-1)
    .expand(last_hidden_state.size())
    .float().to(last_hidden_state.device)
)

# Get attn mask of shape [bs, seq_len, hid_dim]
input_mask_expanded = (
    batch_tokens["attention_mask"]
    .unsqueeze(-1)
    .expand(last_hidden_state.size())
    .float()
)

# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
sum_mask = torch.sum(input_mask_expanded * weights, dim=1)

embeddings = sum_embeddings / sum_mask

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])
cosine_sim_0_3 = 1 - cosine(embeddings[0], embeddings[3])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[3], cosine_sim_0_3))

Asymmetric Semantic Search BE

import torch
from transformers import AutoModel, AutoTokenizer
from scipy.spatial.distance import cosine

# Get our models - The package will take care of downloading the models automatically
# For best performance: Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit
tokenizer = AutoTokenizer.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")
model = AutoModel.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")
# Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)
model.eval()

queries = [
    "I'm searching for a planet not too far from Earth.",
]

docs = [
    "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.",
    "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.",
    "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.",
]

SPECB_QUE_BOS = tokenizer.encode("[", add_special_tokens=False)[0]
SPECB_QUE_EOS = tokenizer.encode("]", add_special_tokens=False)[0]

SPECB_DOC_BOS = tokenizer.encode("{", add_special_tokens=False)[0]
SPECB_DOC_EOS = tokenizer.encode("}", add_special_tokens=False)[0]


def tokenize_with_specb(texts, is_query):
    # Tokenize without padding
    batch_tokens = tokenizer(texts, padding=False, truncation=True)   
    # Add special brackets & pay attention to them
    for seq, att in zip(batch_tokens["input_ids"], batch_tokens["attention_mask"]):
        if is_query:
            seq.insert(0, SPECB_QUE_BOS)
            seq.append(SPECB_QUE_EOS)
        else:
            seq.insert(0, SPECB_DOC_BOS)
            seq.append(SPECB_DOC_EOS)
        att.insert(0, 1)
        att.append(1)
    # Add padding
    batch_tokens = tokenizer.pad(batch_tokens, padding=True, return_tensors="pt")
    return batch_tokens

def get_weightedmean_embedding(batch_tokens, model):
    # Get the embeddings
    with torch.no_grad():
        # Get hidden state of shape [bs, seq_len, hid_dim]
        last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state

    # Get weights of shape [bs, seq_len, hid_dim]
    weights = (
        torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
        .unsqueeze(0)
        .unsqueeze(-1)
        .expand(last_hidden_state.size())
        .float().to(last_hidden_state.device)
    )

    # Get attn mask of shape [bs, seq_len, hid_dim]
    input_mask_expanded = (
        batch_tokens["attention_mask"]
        .unsqueeze(-1)
        .expand(last_hidden_state.size())
        .float()
    )

    # Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
    sum_mask = torch.sum(input_mask_expanded * weights, dim=1)

    embeddings = sum_embeddings / sum_mask

    return embeddings


query_embeddings = get_weightedmean_embedding(tokenize_with_specb(queries, is_query=True), model)
doc_embeddings = get_weightedmean_embedding(tokenize_with_specb(docs, is_query=False), model)

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])
cosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])
cosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[0][:20] + "...", cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[1][:20] + "...", cosine_sim_0_2))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[2][:20] + "...", cosine_sim_0_3))

Cross-Encoder

Asymmetric Semantic Search CE

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy.spatial.distance import cosine

# Get models - The package will take care of downloading the models automatically
# For best performance: EleutherAI/gpt-j-6B
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")
# Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)
model.eval()

prompt = 'Documents are searched to find matches with the same content.\nThe document "{}" is a good search result for "'

queries = [
    "I'm searching for a planet not too far from Earth.",
]

docs = [
    "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.",
    "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.",
    "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.",
]

for query in queries:
    print(f"Query: {query}")
    for doc in docs:
        context = prompt.format(doc)

        context_enc = tokenizer.encode(context, add_special_tokens=False)
        continuation_enc = tokenizer.encode(query, add_special_tokens=False)
        # Slice off the last token, as we take its probability from the one before
        model_input = torch.tensor(context_enc+continuation_enc[:-1])
        continuation_len = len(continuation_enc)
        input_len, = model_input.shape

        # [seq_len] -> [seq_len, vocab]
        logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()
        # [seq_len, vocab] -> [continuation_len, vocab]
        logprobs = logprobs[input_len-continuation_len:]
        # Gather the log probabilities of the continuation tokens -> [continuation_len]
        logprobs = torch.gather(logprobs, 1, torch.tensor(continuation_enc).unsqueeze(-1)).squeeze(-1)
        score = torch.sum(logprobs)
        # The higher (closer to 0), the more similar
        print(f"Document: {doc[:20] + '...'} Score: {score}")

Symmetric Semantic Search CE

You can use the same code as in the above CE-Asym section but change the prompt. Feel free to share prompts that work well :)

Use SGPT with Sentence Transformers

Bi-Encoder ST

Symmetric Semantic Search BE ST

Symmetric models are now 100% compatible with the latest sentence-transformers via pip install git+https://github.com/UKPLab/sentence-transformers.git. You should get the same results as in the HuggingFace script above.

from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

texts = [
    "deep learning",
    "artificial intelligence",
    "deep diving",
    "artificial snow",
]

model = SentenceTransformer("Muennighoff/SGPT-125M-weightedmean-nli-bitfit")
embeddings = model.encode(texts)

cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])
cosine_sim_0_3 = 1 - cosine(embeddings[0], embeddings[3])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[3], cosine_sim_0_3))

Asymmetric Semantic Search BE ST

SGPT Sentence Transformers

Install: pip install --upgrade git+https://github.com/Muennighoff/sentence-transformers.git@sgpt_poolings_specb Use the below, which produces the exact same scores as the HuggingFace solution above.

from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

queries = [
    "I'm searching for a planet not too far from Earth.",
]

docs = [
    "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.",
    "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.",
    "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.",
]

class SentenceTransformerSpecb(SentenceTransformer):
    # Requires:
    # pip install git+https://github.com/Muennighoff/sentence-transformers.git@sgpt_poolings_specb
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        tokens = ["[SOS]", "{SOS}"]
        self._first_module().tokenizer.add_tokens(tokens, special_tokens=True)
        self._first_module().auto_model.resize_token_embeddings(len(self._first_module().tokenizer))
        # Will be replaced with the rep tokens in the model ones
        # The problem is we don't know if a text is query or document when tokenizing in the Transformer.py module, 
        # so we use the SOS tokens as an identifier if we have a query or document at hand & then replace them
        # If we would directly use the brackets here, they may become part of another token
        self._first_module().bos_spec_token_q = self._first_module().tokenizer.encode("[SOS]", add_special_tokens=False)[0]
        self._first_module().bos_spec_token_d = self._first_module().tokenizer.encode("{SOS}", add_special_tokens=False)[0]
        self._first_module().bos_spec_token_q_rep = self._first_module().tokenizer.encode("[", add_special_tokens=False)[0]
        self._first_module().eos_spec_token_q = self._first_module().tokenizer.encode("]", add_special_tokens=False)[0]
        self._first_module().bos_spec_token_d_rep = self._first_module().tokenizer.encode("{", add_special_tokens=False)[0]
        self._first_module().eos_spec_token_d = self._first_module().tokenizer.encode("}", add_special_tokens=False)[0]
        self._first_module().replace_bos = True

    def encode(self, sentences, **kwargs):
        is_query = kwargs.pop("is_query", True)
        if is_query:
            sentences = "[SOS]" + sentences if isinstance(sentences, str) else ["[SOS]" + sent for sent in sentences]
        else:
            sentences = "{SOS}" + sentences if isinstance(sentences, str) else ["{SOS}" + sent for sent in sentences]    
        return super().encode(sentences, **kwargs)
        
model = SentenceTransformerSpecb("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")

query_embeddings = model.encode(queries, is_query=True)
doc_embeddings = model.encode(docs, is_query=False)

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])
cosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])
cosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[0][:20] + "...", cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[1][:20] + "...", cosine_sim_0_2))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[2][:20] + "...", cosine_sim_0_3))
Original Sentence Transformers

If you want to use the Sentence Transformers at https://github.com/UKPLab/sentence-transformers, you can use the below. Make sure to use the latest version (pip install --upgrade git+https://github.com/UKPLab/sentence-transformers.git). Note that this will produce slightly worse scores than SGPT Sentence Transformers, as the special brackets may get intermingled with other tokens upon tokenization. On SciFact (BEIR) NDCG@10 of the below decreases to 0.566 from 0.569 for SGPT-125M-weightedmean-msmarco-specb-bitfit.

from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

queries = [
    "I'm searching for a planet not too far from Earth.",
]

docs = [
    "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.",
    "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.",
    "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.",
]

class SentenceTransformerSpecb(SentenceTransformer):
    def encode(self, sentences, **kwargs):
        is_query = kwargs.pop("is_query", True)
        if is_query:
            sentences = "[" + sentences + "]" if isinstance(sentences, str) else ["[" + sent + "]" for sent in sentences]
        else:
            sentences = "{" + sentences + "}" if isinstance(sentences, str) else ["{" + sent + "}" for sent in sentences]    
        return super().encode(sentences, **kwargs)
        
model = SentenceTransformerSpecb("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")

query_embeddings = model.encode(queries, is_query=True)
doc_embeddings = model.encode(docs, is_query=False)

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])
cosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])
cosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[0][:20] + "...", cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[1][:20] + "...", cosine_sim_0_2))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[2][:20] + "...", cosine_sim_0_3))

Acknowledgements

We thank Constantin Eichenberg and Samuel Weinbach for insightful discussions and valuable feedback throughout the project. We thank Robert Baldock, Marco Bellagente and Koen Oostermeijer for reading drafts of the paper. This work has been supported by OpenAI under the academic access program. This work would not have been possible without:

Citation

Feel free to cite our paper if SGPT is helpful to you :)

@article{muennighoff2022sgpt,
  title={SGPT: GPT Sentence Embeddings for Semantic Search},
  author={Muennighoff, Niklas},
  journal={arXiv preprint arXiv:2202.08904},
  year={2022}
}

sgpt's People

Contributors

aksj98 avatar muennighoff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sgpt's Issues

Training on unlabeled data (German)

Hey, I have a specific domain and unlabeled data. As I see, SGPT accepts only labeled data. Do you have experience with using artificially created labeled data? For example with GPL method? Moreover is there any good pre-trained model variant for German that can be fine-tuned?

Possible to quantize into 4-bit and 8-bit and still use the models

Hi, was wondering if it's possible to do something like a GPTQ quantization into 8 or 4 bit and be able to use the embeddings from the models.
GPTQ 4-bit models perform quite well compared to fp16 & 32 in text generation. Wasn't sure if such a thing would work for embeddings.
Any suggestions?

Theory question about the token weightings for symetric search.

First things first, I loved reading your paper. Was clear, concise and has great implications for semantic search going forward. Cannot compiment highly enough!

One question. I would like to make use of a similar method to get semantic embedding for non GPT auto regressive language models. In the paper I read

The causal attention mask in an auto-regressive decoder transformer, tokens do not attend to
future tokens like in an encoder transformer. Hence, only the last token has attended to all tokens in a
sequence. To account for this information mismatch, we propose to give later tokens a higher weight
using a position-weighted mean pooling method:
v =
S∑
i=1
wihi where wi = i
∑S
i=1 i (2)
where S is the sequence length, hi the ith hidden state and v the query or document embedding. We
compare weighted mean pooling with last token pooling, where the hidden state of the final token is
the embedding, and regular mean pooling

This trick is really neat, but I was wondering if this would work for autoregressive decoder only models that use a causal language model loss, for example the XGLM model set? https://huggingface.co/facebook/xglm-564M

How about for autoregressive LM's that do not make use of causal language model losses, but instead use next-token prediction language modeling? Such as the CodeGen model set? https://arxiv.org/pdf/2203.13474.pdf [if you are unfamilar, the training section is 2.2 :) ]

I understand if there is not a clear answer to these questions, but I would love to hear your thoughts either way.
Thanks again!

Asymmetric search models with longer max seq length?

Hi
Some great work in this repo. I've been trying to get it to work in my asymmetric search application - basically a document retrieval application.
Currently I use one of the sentence transformer models trained by UKPlabs, with a max seq length of 512 tokens. But most of my documents are quite a bit longer.
Was wondering if any of the SGPT models that you or anyone else might have trained have a longer max length? Most of what I see on huggingface has a max length of 75 or 300.

Thanks

Why use low chunksizes?

Hi!

I saw that you have used lower chunksizes (2-4) in training of models, may I know why? I am sure 40GB of RAM in a GPU can handle more? Does it give better empirical results?

Thanks!

Evaluating cross encoders

Hi @Muennighoff,
I would like to use your cross encoder with different GPT models.
I have noticed that this script is different from the code in the notebook. Could you explain the difference? Which code should I use, if I want to evaluate cross encoding for different GPT models (e.g. BioGPT)?

Also, do you happen to have the code for running the script in batches, as it is quite slow to predict each query / document pair one by one?
Thanks
Mark

Learning rate & schedule

Hi @Muennighoff

thanks for your interesting paper and for sharing this repo! You seem using different learning rates and different schedules (sometimes with warm-up, sometimes constant LR).

Did you find these settings empirically? Do they depend on the model size or dataset? Is there any particular LR setting you would recommend when training a GPT model with BitFit?

cannot reproduce leaderboard result

Hello Niklas,
I have a question regarding reproducing SGPT's result. On the mteb leaderboard, the 125M-weightedmean-msmarco-specb-bitfit model achieve 12.21 NDCG@10 on SCIDOCS. However, I wasn't able reproduce the result following the the instruction here. In my benchmarking, I got a very low number (0.00085). I think the instructions are a bit off.

My second question is that I couldn't really understand the idea behind these block . Looking at how you tokenize queries and corpus , it is much more natural to me to simply wrap queries text by [ ] and corpus text by { } before tokenizing them. I got an NDCG of 11.09 for preprocessing SCIDOCS this way, which is much closer to the reported number on the leaderboard.

Need help fine-tuning sgpt model

Hello,

I am interested in using your pre-trained sgpt model for my project, but I am a bit lost when it comes to fine-tuning it for my specific use case(let's say i want to use dbpeida). I was wondering if someone could provide me with a guide or some resources to help me get started.

I would really appreciate any help you can give me. Thank you in advance!

Different model sizes

Great work!
Currently trying out some of your models. I barely can't fit sgpt-5 billion and sgpt-bloom 7 billion on my rtx 3090.
Thus I have two ideas:

  • just a little smaller than 5 billion would fit, it is quite a common gpu, would be useful
  • bigger is better... I want to try 16 bit versions of your models, not sure how, but should be possible, which of cause leads to the question:
  • could there be sota 20 billion parameter sgpts I can use in 8 bit mode? I'd guess the lower resolution is worth increasing model size performance wise

Usage for semantic search

Hello,
I want to create a semantic search functionality. The model is accurate in calculating the similarity between the word synonyms. Here are a few excerpts of the demo:
Supposedly, my dataset has the following lines:

User: Who is the author of "Romeo and Juliet"?
Model: "Romeo and Juliet" was written by William Shakespeare.

User: Describe the water cycle.
Model: The water cycle is the continuous movement of water on, above, and below the surface of the Earth, involving processes like evaporation, condensation, precipitation, and runoff.

User: Describe the process of DNA replication.
Model: DNA replication is the process of copying DNA to ensure genetic continuity during cell division.

And my prompts:

Prompt 1: rameo and juliet
Output: "Romeo and Juliet" was written by William Shakespeare.

Prompt 2: Tell me about water cycle
Output: The water cycle is the continuous movement of water on, above, and below the surface of the Earth, involving processes like evaporation, condensation, precipitation, and runoff.

Prompt 3: Deoxyribonucleic acid
Output: DNA replication is the process of copying DNA to ensure genetic continuity during cell division.

Prompt 4: what is python language
Output: Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll pigments.

The confidence value for each of the prompts ranges between 0.25 and 0.4. The issue I am facing is, that the model gives the same confidence value for wrong and right prompt outputs as seen in prompt 4 output. I want to show "No results found" if the given word is not in the dataset.

How do I solve this issue and make it more efficient? Thank you in advance.

Error when using sentence_transformer

When I run: model = SentenceTransformer("Muennighoff/SGPT-125M-weightedmean-nli-bitfit")
the following error happens: TypeError: Pooling.__init__() got an unexpected keyword argument 'pooling_mode_weightedmean_tokens'.

My machine:

  • OS: MacOs, chip M1
  • pytorch: 1.12.1
  • sentence_transformer: 2.2.2
  • python: 3.10

ValueError: not enough values to unpack (expected 2, got 1)

Hi,

I'm trying to run the Cross-Encoder given example.
i faced this error in Line no 41.

   torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()

i tried modifying dimentions but no luck, tried to give input as 2 seperate parameters but also did not work.

Please kindly help me which parameter should i update to avoid unpack error.

Thank You and Much Appreciated

 ValueError                                Traceback (most recent call last)
 Cell In [54], line 41
 37 print(input_len)
 38 # print('model',model(model_input))
 39 
 40 # [seq_len] -> [seq_len, vocab]
  ---> 41 logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()
 42 # [seq_len, vocab] -> [continuation_len, vocab]
 43 logprobs = logprobs[input_len-continuation_len:]

 File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
    1127 # this function, and just call forward.
    1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
    1129         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1130     return forward_call(*input, **kwargs)
    1131 # Do not call functions when jit is used
    1132 full_backward_hooks, non_full_backward_hooks = [], []

    File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py:974, in 
    GPTNeoForCausalLM.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, 
    inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    966 r"""
    967 labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
    968     Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
    969     ``labels = input_ids`` Indices are selected in ``[-100, 0, ..., config.vocab_size]`` All labels set to
    970     ``-100`` are ignored (masked), the loss is only computed for labels in ``[0, ..., config.vocab_size]``
    971 """
    972 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    --> 974 transformer_outputs = self.transformer(
    975     input_ids,
    976     past_key_values=past_key_values,
    977     attention_mask=attention_mask,
    978     token_type_ids=token_type_ids,
    979     position_ids=position_ids,
    980     head_mask=head_mask,
    981     inputs_embeds=inputs_embeds,
    982     use_cache=use_cache,
    983     output_attentions=output_attentions,
    984     output_hidden_states=output_hidden_states,
    985     return_dict=return_dict,
    986 )
    987 hidden_states = transformer_outputs[0]
    989 lm_logits = self.lm_head(hidden_states)

   File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, 
  *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
    1127 # this function, and just call forward.
    1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
     1129         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1130     return forward_call(*input, **kwargs)
    1131 # Do not call functions when jit is used
    1132 full_backward_hooks, non_full_backward_hooks = [], []

     File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py:799, in 
     GPTNeoModel.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, 
     inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
     796     global_attention_mask = None
     798 # Local causal attention mask
     --> 799 batch_size, seq_length = input_shape
     800 full_seq_length = seq_length + past_length
     801 local_attention_mask = GPTNeoAttentionMixin.create_local_attention_mask(
     802     batch_size, full_seq_length, self.config.window_size, device, attention_mask
     803 )

     ValueError: not enough values to unpack (expected 2, got 1)

Use SGPT

Hey 👋

I want to use sgpt in my website as sentence embedding for semantic search, but i can't understand how can i use both of cross-encoder and bi-encoder in one code, or i can use only one of them?

Metric scores for CQADupStack

Hey, what an awesome paper! I was looking for some benchmarks for semantic search and I must say it will be really useful in my research :) In this regard, I wanted to ask a question - on this page on papers with code: https://paperswithcode.com/sota/information-retrieval-on-cqadupstack you're listed as highest scoring for CQADupStack, however the metric they list there is mAP@100, and in your paper the metric listed under this score nDCG@10. Which one is correct? I'm asking, as I will be using this dataset to validate implementations of a few models, and wanted to be sure I'll be comparing correct metrics

When I train an encoder using bloom 3b, I get this error? What is the cause of this problem, please?

Here is the command line I used:

### https://huggingface.co/docs/accelerate/basic_tutorials/launch

!accelerate launch --multi_gpu --mixed_precision bf16 --num_processes 7 train_bi-encoder_mnrl.py \
--train_batch_size 8  \
--eval_batch_size 8 \
--lr 2e-5  \
--epochs 5 \
--asym \
--pooling weightedmean \
--max_seq_length 512 \
--pooling weightedmean \
--wandbwatchlog gradients \
--specb  \
--freezenonbias  \
--gradcache \
--chunksize 4

Then the following error was reported:

NotImplementedError: Model input split not implemented for type <class 'dict'>
Iteration:   0%|                                       | 0/2743 [00:01<?, ?it/s]
Epoch:   0%|                                              | 0/5 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/root/data_process/train_bi-encoder_mnrl.py", line 375, in <module>
    model.fit(train_objectives=[(train_dataloader, train_loss)],
  File "/root/data_process/sentence_transformers/SentenceTransformer.py", line 801, in fit
    loss_value = loss_model(features, labels)
  File "/root/data_process/sentence_transformers/losses/MultipleNegativesRankingLoss.py", line 153, in __call__
    return super().__call__(*sentence_features, no_sync_except_last=no_sync_except_last)
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 70, in __call__
    return self.cache_step(*args, **kwargs)
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 266, in cache_step
    model_inputs = [self.split_inputs(x, chunk_size) for x, chunk_size in zip(model_inputs, self.chunk_sizes)]
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 266, in <listcomp>
    model_inputs = [self.split_inputs(x, chunk_size) for x, chunk_size in zip(model_inputs, self.chunk_sizes)]
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 102, in split_inputs
    raise NotImplementedError(f'Model input split not implemented for type {type(model_input)}')
NotImplementedError: Model input split not implemented for type <class 'dict'>
Iteration:   0%|                                       | 0/2743 [00:01<?, ?it/s]
Epoch:   0%|                                              | 0/5 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/root/data_process/train_bi-encoder_mnrl.py", line 375, in <module>
    model.fit(train_objectives=[(train_dataloader, train_loss)],
  File "/root/data_process/sentence_transformers/SentenceTransformer.py", line 801, in fit
    loss_value = loss_model(features, labels)
  File "/root/data_process/sentence_transformers/losses/MultipleNegativesRankingLoss.py", line 153, in __call__
    return super().__call__(*sentence_features, no_sync_except_last=no_sync_except_last)
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 70, in __call__
    return self.cache_step(*args, **kwargs)
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 266, in cache_step
    model_inputs = [self.split_inputs(x, chunk_size) for x, chunk_size in zip(model_inputs, self.chunk_sizes)]
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 266, in <listcomp>
    model_inputs = [self.split_inputs(x, chunk_size) for x, chunk_size in zip(model_inputs, self.chunk_sizes)]
  File "/data/anaconda3/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 102, in split_inputs
    raise NotImplementedError(f'Model input split not implemented for type {type(model_input)}')
NotImplementedError: Model input split not implemented for type <class 'dict'>

But of course, when I set asym to False, it works perfectly, I don't know what the problem is? Can you help me out? Thank you!

the example model of 'SGPT-125M-weightedmean-nli-bitfit' not compatible with ST

hi, I tried to run the example be_st with 'Muennighoff/SGPT-125M-weightedmean-nli-bitfit'
but it throw out the exception as follows:
/usr/local/lib/python3.8/site-packages/sentence_transformers/models/Pooling.py:120 in load TypeError: __init__() got an unexpected keyword argument 'pooling_mode_weightedmean_tokens'
Is ’pooling_mode_weightedmean_tokens‘ in the config.json changed to ’pooling_mode_mean_tokens’ ?

Fine-tune Muennighoff/SGPT-2.7B-weightedmean-msmarco-specb-bitfit using TSDAE approach

Hi,

I am trying to fine-tune SGPT-2.7B-weightedmean-msmarco-specb-bitfit with unlabeled dataset using TSDAE approach. Getting this error:

Type Error: forward() got an unexpected keyword argument 'encoder_hidden_states'.

Please help. Thanks!

Stack Trace:

File ~.../sentence_transformers/losses/DenoisingAutoEncoderLoss.py:111, in DenoisingAutoEncoderLoss.forward(self, sentence_features, labels)
108 label_ids = target_features['input_ids'][:, 1:]
110 # Decode
--> 111 decoder_outputs = self.decoder(
112 input_ids=decoder_input_ids,
113 inputs_embeds=None,
114 attention_mask=None,
115 encoder_hidden_states=reps[:, None], # (bsz, hdim) -> (bsz, 1, hdim)
116 encoder_attention_mask=source_features['attention_mask'][:, 0:1],
117 labels=None,
118 return_dict=None,
119 use_cache=False
120 )
122 # Calculate loss
123 lm_logits = decoder_outputs[0]

File .../dist-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'encoder_hidden_states'

fine-tune sgpt-bloom-7b1-msmarco oom

Hi, I have problem in fine-tunning sgpt-bloom-7b1-msmarco because of oom error, could you please share how you do contrasive fine-tuning on bloom-7b1? (I think distributed training is needed, but I failed ..)

Construct SGPT

Hello, I am currently working on my project and I'm interested on your paper.

First of all, is it correct that SGPT is based on GPT-Neo? If yes, it is possible to construct SGPT that is based on GPT-2?
How to construct it from scratch?

Thank you

OpenAI-GPT3 search endpoint deprecated

Hello there, the search end point(purpose) as been deprecated is there any alternative solution on this. This happens at CE-BEIR notebook.
Thanks in advance.

Training SGPT for Custom Dataset

Hi I read your paper that is cool, am trying to do this on my own dataset and my dataset is huge. Can you please tell me the exact ways to train from the scratch to achieve SGPT- both symmetric and asymmetric in both the encoder. But cross encoder would be our interest.
I Have one doubt are you using bert to produce cross and BI encoder embedding. In my understanding you are using BERT as initial pipeline before fetching it to GPT to produce the cosine similarity and log probabilities please help

Can we switch Casual attention to Self attention for SGPT?

If I wanted to generate an embedding for a sentence using a decoder, should it necessarily follow a casual attention?

eg: This is a sample sentence.

lets say each word is a token. Now instead of sending in each token one by one and applying casual attention mask, to get the token embedding and then do a position-weighted mean pooling to get the sentence embedding...

Why can't we give the entire sentence all together and apply a self attention mask to get the sentence embedding?

I get that we are trying to stick to logic followed in the training process. But just wondering whether something like this should work.

Can I use multi GPUS

I have 2 GPUS, each on has 24G memory.
when I run code below

model = SentenceTransformerSpecb( "bigscience/sgpt-bloom-7b1-msmarco", cache_folder = "/mnt/storage/agtech/modelCache", ) query_embeddings = model.encode(queries, is_query=True)
got OutOfMemoryError, it only use the first GPU. Can it load the model on two gpus?

OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 22.03 GiB total capacity; 21.27 GiB
already allocated; 50.94 MiB free; 21.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF

If I input more than the max_seq_length?

I see that sgpt-bloom-7b1-mamarco model has a vector length of 300,but

If I input more than the maximum length, for example, input more than 400 Chinese characters, it seems that it can also be embedded in the vector, but it seems that the increase to more than 500 will not affect the vector calculation results.

Can I enter a maximum Chinese character of 500?

HuggingFace script and Sentence Transformers script giving different results

I copy-pasted the two scripts [0][1] into a notebook without any changes. They produce different embeddings and different results.
HG gives:
Cosine similarity between "I'm searching for a planet not too far from Earth." and "Neptune is the eight..." is: 0.622
Cosine similarity between "I'm searching for a planet not too far from Earth." and "TRAPPIST-1d, also de..." is: 0.490
Cosine similarity between "I'm searching for a planet not too far from Earth." and "A harsh desert world..." is: 0.433

Sentence Transformers gives:
Cosine similarity between "I'm searching for a planet not too far from Earth." and "Neptune is the eight..." is: 0.480
Cosine similarity between "I'm searching for a planet not too far from Earth." and "TRAPPIST-1d, also de..." is: 0.370
Cosine similarity between "I'm searching for a planet not too far from Earth." and "A harsh desert world..." is: 0.369

I checked the embeddings; both the doc and the query embeddings are different between the two scripts. I also tried running on GPU (by adding .cuda() in relevant places) - same results as above.

If it helps, I can dump the embedding vectors or the full code in the comments.

It would be nice to have the expected output in the README as well.

[0] https://github.com/Muennighoff/sgpt#asymmetric-semantic-search-be
[1] https://github.com/Muennighoff/sgpt#asymmetric-semantic-search-be-st

accelerate + deepspeed?

Hi, I'm interested in running the example found in biencoder/nli_msmarco/scripts/train_bloom7b1.slurm. Is it possible to execute these using accelerate and deepspeed?
I'm planning to experiment with larger models, hence the need for deepspeed zero3.

Usage for text-2-text-generation

I am relatively new to using Hugging Face models. I found this model and I think that it may work well with my use case which is to create a bot that can answer questions based on a pdf. I know that I need embeddings to do so. However, I am unsure of how to continue. How can I use the base model from hugging face to create an llm pipeline or is that not possible with this model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.