Code Monkey home page Code Monkey logo

linktransformer's Introduction

LinkTransformer

License: GPL v3 arXiv

Linkktransformers demo

LinkTransformer is a Python library for merging and deduplicating data frames using language model embeddings. It leverages popular Sentence Transformer (or any HuggingFace) models to generate embeddings for text data and provides functions to perform efficient 1:1, 1:m, and m:1 merges based on the similarity of embeddings. Additionally, the package includes utilities for clustering and data preprocessing. It also includes modifications to Sentence Transformers that allow for logging training runs on weights and biases.

More tutorials are coming soon!

News

  • If you are just looking to demo the package on your data, try out this HuggingFace Space to merge two data frames!!
  • New (and retrained) models are up on the organisation's (dell-research-harvard) huggingface (with the same paths). Be sure to check them out. Most get a bump in the training data size, better hyperparams (and loss function choices). Training data for each model is also included within the repo.
  • Online contrastive loss is now available for linkage model training - especially useful for training with paired data containing binary labels. Try it out! The tutorial will be out soon.

Features

  • Merge dataframes using language model embeddings
  • Deduplicate data based on similarity threshold
  • Efficient 1:1, 1:m, and m:1 merges
  • Clustering methods for grouping similar data
  • Support for various NLP models available on HuggingFace
  • Classification - prediction, and training in one line of code!

Coming soon

  • FAISS GPU, cuDF, cuML and cuGraph integration
  • Integration of other modalities in this framework (Vision/Multimodal models)
  • Hard negative mining for efficient training

Installation

pip install linktransformer

Getting Started

import linktransformer as lt
# Example usage of lm_merge_df
merged_df = lt.merge(df1, df2, merge_type='1:1', on='key_column', model='your-pretrained-model-from-huggingface')

# Example usage of dedup
deduplicated_df = lt.dedup_rows(df, model='your-pretrained-model', on='text_column', threshold=0.8)

All transformer based models from HuggingFace are supported. We recommend sentence-transformers for these tasks as they are trained for semantic similarity tasks.

Usage

Merging Pandas Dataframes

The merge function is used to merge two dataframes using language model embeddings. It supports three types of merges: 1:1, 1:m, and m:1. The function takes the following parameters:

Linkktransformers demo_example

def merge(df1, df2, merge_type='1:1', on=None, model='your-pretrained-model', left_on=None, right_on=None, suffixes=('_x', '_y'),
                 use_gpu=False, batch_size=128, pooling_type='mean', openai_key=None):
    """
    Merge two dataframes using language model embeddings.

    :param df1 (DataFrame): First dataframe (left).
    :param df2 (DataFrame): Second dataframe (right).
    :param merge_type (str): Type of merge to perform (1:m or m:1 or 1:1).
    :param model (str): Language model to use.
    :param on (Union[str, List[str]], optional): Column(s) to join on in df1. Defaults to None.
    :param left_on (Union[str, List[str]], optional): Column(s) to join on in df1. Defaults to None.
    :param right_on (Union[str, List[str]], optional): Column(s) to join on in df2. Defaults to None.
    :param suffixes (Tuple[str, str]): Suffixes to use for overlapping columns. Defaults to ('_x', '_y').
    :param use_gpu (bool): Whether to use GPU. Not supported yet. Defaults to False.
    :param batch_size (int): Batch size for inferencing embeddings. Defaults to 128.
    :param openai_key (str, optional): OpenAI API key for InferKit API. Defaults to None.
    :return: DataFrame: The merged dataframe.
    """

A special case of merging is aggregation (use function: aggregate_rows)- when the left key is a list of items that need aggregation to the right keys. Semantic linking is also allowed with multiple columns as keys in both datasets. For larger datasets, merge_blocking can be used to merge within blocking keys.

Clustering or Deduplicating Data

Linkktransformers demo_dedup

def dedup_rows(df, model='your-pretrained-model', on='text_column', cluster_type='preferred_cluster_type', cluster_params= {'threshold': 0.8}, openai_key=None):
    """
    Deduplicate a dataframe based on a similarity threshold. This is just clustering and keeping the first row in each cluster.
    Refer to the docs for the cluster_rows function for more details.

    :param df (DataFrame): Dataframe to deduplicate.
    :param model (str): Language model to use.
    :param on (Union[str, List[str]]): Column(s) to deduplicate on.
    :param cluster_type (str): Clustering method to use. Defaults to "SLINK".
    :param cluster_params (Dict[str, Any]): Parameters for clustering method. Defaults to {'threshold': 0.5, "min cluster size": 2, "metric": "cosine"}.
    :param openai_key (str): OpenAI API key
    :return: DataFrame: The deduplicated dataframe.
    """

def cluster_rows(df, model='your-pretrained-model', on='text_column', cluster_type='preferred_cluster_type', cluster_params= {'threshold': 0.8}, openai_key=None):
    """
    Deduplicate a dataframe based on a similarity threshold. Various clustering options are supported.         
    "agglomerative": {
            "threshold": 0.5,
            "clustering linkage": "ward",  # You can choose a default linkage method
            "metric": "euclidean",  # You can choose a default metric
        },
        "HDBScan": {
            "min cluster size": 5,
            "min samples": 1,
        },
        "SLINK": {
            "min cluster size": 2,
            "threshold": 0.1,
        },
    }

    :param df (DataFrame): Dataframe to deduplicate.
    :param model (str): Language model to use.
    :param on (Union[str, List[str]]): Column(s) to deduplicate on.
    :param cluster_type (str): Clustering method to use. Defaults to "SLINK".
    :param cluster_params (Dict[str, Any]): Parameters for clustering method. Defaults to {'threshold': 0.5, "min cluster size": 2, "metric": "cosine"}.
    :param openai_key (str): OpenAI API key
    :return: DataFrame: The deduplicated dataframe.
    """

We allow a simple clustering function to cluster rows based on a key. Deduplication is just keeping only one row per cluster.

Get similarity score between 2 sets of columns

def evaluate_pairs(df,model,left_on,right_on,openai_key=None):
    """
    This function evaluates paired columns in a dataframe and gives a match score (cosine similarity). 
    Typically, this can be though of as a way to evaluate already merged in dataframes.

    :param df (DataFrame): Dataframe to evaluate.
    :param model (str): Language model to use.
    :param left_on (Union[str, List[str]]): Column(s) to evaluate on in df.
    :param right_on (Union[str, List[str]]): Reference column(s) to evaluate on in df.
    :return: DataFrame: The evaluated dataframe.
    """

Training your own LinkTransformer model

def train_model(
    data: Union[str, pd.DataFrame] = None,
    train_data: Union[str, pd.DataFrame] = None,
    val_data: Union[str, pd.DataFrame] = None,
    test_data: Union[str, pd.DataFrame] = None,
    model_path: str="sentence-transformers/paraphrase-xlm-r-multilingual-v1",
    left_col_names: List[str] = ["description47"],
    right_col_names: List[str] = ['description48'],
    left_id_name: List[str] = ['tariffcode47'],
    right_id_name: List[str] = ['tariffcode48'],
    label_col_name: str = None,
    config_path: str = LINKAGE_CONFIG_PATH,
    training_args: dict = {"num_epochs":10},
    log_wandb: bool = False,
) -> str:
    """
    Train the LinkTransformer model.

    :param: model_path (str): The name of the model to use.
    :param: data (str): Path to the dataset in Excel or CSV format or a dataframe object.
    :param: left_col_names (List[str]): List of column names to use as left side data.
    :param: right_col_names (List[str]): List of column names to use as right side data.
    :param: left_id_name (List[str]): List of column names to use as identifiers for the left data.
    :param: right_id_name (List[str]): List of column names to use as identifiers for the right data,
    :param: label_col_name (str): Name of the column to use as labels. Specify this if you have data of the form (left, right, label). This type supports both positive and negative examples.
    :param: clusterid_col_name (str): Name of the column to use as cluster ids. Specify this if you have data of the form (text, cluster_id). 
    :param: cluster_text_col_name (str): Name of the column to use as cluster text. Specify this if you have data of the form (text, cluster_id).
    :param: config_path (str): Path to the JSON configuration file.
    :param: training_args (dict): Dictionary of training arguments to override the config.
    :param: log_wandb (bool): Whether to log the training run on wandb.
    :return: The path to the saved best model.
    """

You can even Classify rows of text into predefined classes!

Use pretrained models (ChatGPT or HuggingFace!)

def classify_rows(
    df: DataFrame,
    on: Optional[Union[str, List[str]]] = None,
    model: str = None,
    num_labels: int = 2,
    label_map: Optional[dict] = None,
    use_gpu: bool = False,
    batch_size: int = 128,
    openai_key: Optional[str] = None,
    openai_topic: Optional[str] = None,
    openai_prompt: Optional[str] = None,
    openai_params: Optional[dict] = {}
):
    """
    Classify texts in all rows of one or more columns whether they are relevant to a certain topic. The function uses
    either a trained classifier to make predictions or an OpenAI API key to send requests and retrieve classification
    results from ChatCompletion endpoint. The function returns a copy of the input dataframe with a new column "clf_preds_{on}" that stores the
    classification results.

    :param df: (DataFrame) the dataframe.
    :param on: (Union[str, List[str]], optional) Column(s) to classify (if multiple columns are passed in, they will be joined).
    :param model: (str) filepath to the model to use (to use OpenAI, see "https://platform.openai.com/docs/models").
    :param num_labels: (int) number of labels to predict. Defaults to 2.
    :param label_map: (dict) a dictionary that maps text labels to numeric labels. Used for OpenAI predictions.
    :param use_gpu: (bool) Whether to use GPU. Not supported yet. Defaults to False.
    :param batch_size: (int) Batch size for inferencing embeddings. Defaults to 128.
    :param openai_key: (str, optional) OpenAI API key for InferKit API. Defaults to None.
    :param openai_topic: (str, optional) The topic predict whether the text is relevant or not. Defaults to None.
    :param openai_prompt: (str, optional) Custom system prompt for OpenAI ChatCompletion endpoint. Defaults to None.
    :param openai_params: (str, optional) Custom parameters for OpenAI ChatCompletion endpoint. Defaults to None.
    :returns: DataFrame: The dataframe with a new column "clf_preds_{on}" that stores the classification results.
    """

Train your own model!

def train_clf_model(data=None,model="distilroberta-base",on=[],label_col_name="label",train_data=None,val_data=None,test_data=None,data_dir=".",
                    training_args={},config=CLF_CONFIG_PATH,
                    eval_steps=None,save_steps=None,batch_size=None,lr=None,
                    epochs=None,model_save_dir=".", weighted_loss=False,weight_list=None,
                    wandb_log=False,wandb_name="topic",
                    print_test_mistakes=False):
    """
    Trains a text classification model using Hugging Face's Transformers library.
    
    :param data: (str/DataFrame, optional) Path to the CSV file or a DataFrame object containing the training data.
    :param model: (str, default="distilroberta-base") The name of the Hugging Face model to be used.
    :param on: (list, default=[]) List of column names that are used as input features.
    :param label_col_name: (str, default="label") The column name in the data that contains the labels.
    :param train_data: (str/DataFrame, optional) Training dataset if `data` is not provided.
    :param val_data: (str/DataFrame, optional) Validation dataset if `data` is not provided.
    :param test_data: (str/DataFrame, optional) Test dataset if `data` is not provided.
    :param data_dir: (str, default=".") Directory where training data splits are saved.
    :param training_args: (dict, default={}) Training arguments for the Hugging Face Trainer.
    :param config: (str, default=CLF_CONFIG_PATH) Path to the default config file.
    :param eval_steps: (int, optional) Evaluation interval in terms of steps.
    :param save_steps: (int, optional) Model saving interval in terms of steps.
    :param batch_size: (int, optional) Batch size for training and evaluation.
    :param lr: (float, optional) Learning rate.
    :param epochs: (int, optional) Number of training epochs.
    :param model_save_dir: (str, default=".") Directory where the trained model will be saved.
    :param weighted_loss: (bool, default=False) If true, uses weighted loss based on class frequencies.
    :param weight_list: (list, optional) Weights for each class in the loss function.
    :param wandb_log: (bool, default=False) If true, logs metrics to Weights & Biases.
    :param wandb_name: (str, default="topic") Name of the Weights & Biases project.
    :param print_test_mistakes: (bool, default=False) If true, prints the misclassified samples in the test dataset.
    
    :return: 
        - best_model_path (str): Path to the directory of the best saved model.
        - best_metric (float): The best metric value achieved during training.
        - label_map (dict): Mapping of labels to their respective integer values.
        
    Note:
        Either the `data` parameter or all of `train_data`, `val_data`, and `test_data` should be provided. If only
        `data` is provided, it will be split into train, validation, and test sets.
    """

Contributing

Contributions are welcome! If you encounter any issues or have suggestions for improvement, please create a new issue or submit a pull request.

License

This project is licensed under the GNU General Public License- see the LICENSE file for details.

Acknowledgments

    • The sentence-transformers library and HugginFace for providing pre-trained NLP models
    • The faiss library for efficient similarity search
    • The sklearn and networkx libraries for clustering and graph operations
    • OpenAI for providing language model embeddings

Roadmap

We will continue to come up with more feature-rich updates and introduce more modalities like images using support for vision and multimodal models within this framework to make those accessible to those with a non-technical background.

Package Maintainers

  • Sam Jones (samuelcaronnajones)
  • Abhishek Arora (econabhishek)
  • Yiyang Chen (oooyiyangc)

linktransformer's People

Contributors

dependabot[bot] avatar econabhishek avatar mikeldt avatar oooyiyangc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

linktransformer's Issues

Fixing `dedup_rows` example

Hey, thanks for sharing this!

Small note - the README.md mentioned the following example for calling dedup_rows which fails:

deduplicated_df = lt.dedup_rows(df, model='your-pretrained-model', on='text_column', threshold=0.8)

I suggest changing to the one in infer_test

deduplicated_df = lt.dedup_rows(df, model='your-pretrained-model', on='text_column', cluster_type='preferred_cluster_type', cluster_params= {'threshold': 0.8})

Suggestion to implement range_search

Hi All, again - wonderful package and just terrific work.

One possible extension you might one day consider would be using FAISS's range_search function, instead of search (see https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes#range-search). This would allow for a "many-to-many" match in the more traditional sense, perhaps aligning the behaviour of the LT package to prior fuzzy matching packages.

The main drawback is that it is not GPU-friendly, but works pretty efficiently on CPUs in my experience.

FWIW, my use-case is to match the universe of job-postings to DnB establishments. I use the range_search along with your firm-name embeddings to to build a dataset with all pairwise matches above a pretty low similarity threshold (0.5). This then gives me a huge set of potential matches, and I use an expectation-maximisation algorithm after this which considers both similarity-scores as well as other structured covariates (but not necessarily exact matching criteria) like industry codes, location-distance, etc to resolve the best match from this candidate set.

One day I would be happy to help implementing this, if you feel it's something you would want to pursue.

Thanks again for all the great work, it's hugely appreciated by many!

Different merge types results in the same logic

Thanks for such a great repository! I just was going through the code where you're able to implement run merge() on two data frames and it looks as if the logic is the same for all three merges (1:m or m:1 or 1:1):

    ## Add to index depending on merge type
    if merge_type == "1:m":
        index.add(embeddings2)
    elif merge_type == "m:1":
        index.add(embeddings2)
    elif merge_type == "1:1":
        index.add(embeddings2)

    ## Search index
    if merge_type == "1:m":
        D, I = index.search(embeddings1, 1)
    elif merge_type == "m:1":
        D, I = index.search(embeddings1, 1)
    elif merge_type == "1:1":
        D, I = index.search(embeddings1, 1)

Could someone explain why this is the case, please? Thanks! ๐Ÿ˜„

Parameter "suffixes" does not work for lt.merge()

Hi, I am using the lt.merge() to link 2 dataframes.

merged_df = lt.merge(df1, df2, merge_type='1:1', on='product', model='dell-research-harvard/lt-wikidata-comp-zh', suffixes=('_1', '_2'))

This worked out fine, but the parameter "suffixes" does not look like it is working.
The column names are still named after "_x" and "_y" as default.
This is a minor issue though.
Thanks for sharing this great module!

Typo in GitHub pages model zoo

Hi, I just wanted to report a minor typo I noticed on the GitHub pages "model zoo" page.

Under "Company Alias linkage", the link "dell-research-harvard/lt-wikidata-comp-es" says it is English, but this should be Spanish.

I would have submitted a pull request but it appears that the underlying repo for the GitHub page is not public.

Wikipedia Alias' and Bigger Base Model for Training

Dear LinkTransformer Team,

First off - congrats on a wonderful package. I have three questions:

  1. Are you able to make the underlying Wikipedia Alias' dataset available? This looked non-trivial to scrape with the Wikimedia API, so would love to learn more about this effort!

  2. Are there any plans to train a larger model with these aliases? I would be very open to collaborate on this, as I think there may be a lot of room for improvement by shifting to a slightly larger model (or a huge model, that is then quantised heavily). I think the current implementation provides a very robust linguistic measure of distance, but is still not quite there for true entity resolution based on aliases.

(PS: The canonical example I have in mind is "BofA" to "Bank of America". I am sure this alias was in the training data, but the base model was a bit too parsimonious to keep this bridge).

  1. Do you have any suggestions / thoughts on how best to use the OAI embeddings models for entity resolution (company names is my specific usecase). My thinking is that appending something like "company name: XXXX" would help a lot for these general purpose embeddings models, but no doubt you have better insight here!

Kind regards,
Peter John Lambert (LSE Econ PhD Student)

KeyError: 'metric' when running cluster_rows

New user of LinkTransformer here. I ran cluster_rows on a dataframe of government contracts with a lot of variation on how vendors are written and got a KeyError. Dataframe is around 10,000 rows.

Other system info:

Windows 10
Python 3.10.11
linktransformer 0.1.11

Here's the full trace:

KeyError                                  Traceback (most recent call last)
Cell In[111], line 1
----> 1 clusters = lt.cluster_rows(contracts, on = 'vendor_name', cluster_params={'threshold': 0.7}, model = 'all-MiniLM-L6-v2')

File ~\miniconda3\envs\data\lib\site-packages\linktransformer\infer.py:390, in cluster_rows(df, model, on, cluster_type, cluster_params, openai_key)
    388 embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
    389 ### Now, cluster the embeddings based on similarity threshold
--> 390 labels = cluster(cluster_type, cluster_params, embeddings, corpus_ids=None)
    391 ### Now, keep only 1 row per cluster
    392 df["cluster"] = labels

File ~\miniconda3\envs\data\lib\site-packages\linktransformer\modified_sbert\cluster_fns.py:61, in cluster(cluster_type, cluster_params, corpus_embeddings, corpus_ids)
     50     clustering_model = AgglomerativeClustering(
     51         n_clusters=None,
     52         distance_threshold=cluster_params["threshold"],
     53         linkage=cluster_params["clustering linkage"],
     54         metric=cluster_params["metric"]
     55     )
     57 if cluster_type == "SLINK":
     58     clustering_model = DBSCAN(
     59         eps=cluster_params["threshold"],
     60         min_samples=cluster_params["min cluster size"],
---> 61         metric=cluster_params["metric"]
     62     )
     64 if cluster_type == "HDBScan":
     65     clustering_model = hdbscan.HDBSCAN(
     66         min_cluster_size=cluster_params["min cluster size"],
     67         min_samples=cluster_params["min samples"],
     68         gen_min_span_tree=True
     69     )

KeyError: 'metric'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.