Code Monkey home page Code Monkey logo

ssl-relation-prediction's Introduction

Relation Prediction as an Auxiliary Training Objective for Knowledge Graph Completion

PWC PWC PWC PWC PWC PWC PWC PWC

This repo contains the code accompanying the paper: “Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations”. We found that incorporating relation prediction into the 1vsAll objective yields a new self-supervised training objective for knowledge base completion (KBC), which results in significant performance improvement (up to 9.9% in Hits@1) by adding as little as 3–10 lines of code. Unleash the true power of your KBC models with the relation prediction objective!

The codebase also comes with SoTA results on several KBC datasets. Echoing previous research, we find that traditional factorisation-based models, e.g. ComplEx, DistMult and RESCAL, can outperform more recently proposed models when trained appropraitely. For most cases, we find the 1vsAll + Relation Prediction objective to be very effective and require less tweaking than more sophisticated architectures.

Table of Contents

⚡ Link Prediction Results

🧩 Pretrained Embeddings

🧭 How to Use This Repo

🥰 Acknowledgement

📃 Citation

✅ Licence

News

  • 01/02/2022 Pretrained ComplEx embeddings on obgl-biokg/ogbl-wikikg2 are released. Check out them here
  • 16/12/2021 Pretrained ComplEx embeddings on FB15K-237/WN18RR/CoDEx-M/CoDEx-S are released. Check out them here
  • 01/12/2021 Hyper-parameters on CoDEx, ogbl-biokg and ogbl-wikikgv2 are released here

⚡ Link Prediction Results

We attempt to include as many results as possible for recent knowledge graph completion datasets and release the hyper-parameters to foster easy reproduction. Feel free to create an issue if you want to suggest additional datasets for us to include.

Currently, we have results on the OGB link property prediction dataset ogbl-biokg, ogbl-wikikg2, codex, Aristo-v4, FB15K237, and WN18RR. All training was done on a single 16GB GPU except for ogbl-wikikg2 which was run on a 32GB GPU.

ogbl-wikikg2

Model Params Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx (50dim)1 250M No 0.3804 - - -
ComplEx (250dim)1 1B No 0.4027 - - -
ComplEx (25dim, ours) 125M No 0.5161 0.4576 0.5310 0.6324
ComplEx (25dim, ours) 125M Yes 0.5192 0.4540 0.5394 0.6483
ComplEx (50dim, ours) 250M No 0.6193 0.5503 0.6468 0.7589
ComplEx (50dim, ours) 250M Yes 0.6392 0.5684 0.6686 0.7822
ComplEx (100dim, ours) 500M No 0.6458 0.5750 0.6761 0.7896
ComplEx (100dim, ours) 500M Yes 0.6509 0.5814 0.6800 0.7923

Note that the training of 50/100 dim takes about 3 days and that additional training time will likely lead to better results. We currently use only one 32GB GPU. Acceleration on multiple GPUs will be considered in the future.

ogbl-biokg

Model Params Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx 2 188M No 0.8095 - - -
ComplEx (ours) 188M No 0.8482 0.7887 0.8913 0.9536
ComplEx (ours) 188M Yes 0.8494 0.7915 0.8902 0.9540

CoDEx-S

Model Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx 3 No 0.465 0.372 0.504 0.646
ComplEx (1000dim, ours) No 0.472 0.378 0.508 0.658
ComplEx (1000dim, ours) Yes 0.473 0.375 0.514 0.663

CoDEx-M

Model Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx 3 No 0.337 0.262 0.370 0.476
ComplEx (1000dim, ours) No 0.351 0.276 0.385 0.492
ComplEx (1000dim, ours) Yes 0.352 0.277 0.386 0.490

CoDEx-L

Model Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx 3 No 0.294 0.237 0.318 0.400
ComplEx (1000dim, ours) No 0.342 0.275 0.374 0.470
ComplEx (1000dim, ours) Yes 0.345 0.277 0.377 0.473

WN18RR

Model Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx No 0.487 0.441 0.501 0.580
ComplEx Yes 0.488 0.443 0.505 0.578

FB15K237

Model Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx No 0.366 0.271 0.401 0.557
ComplEx Yes 0.388 0.298 0.425 0.568

Aristo-v4

Model Using RP? MRR Hits@1 Hits@3 Hits@10
ComplEx No 0.301 0.232 0.324 0.438
ComplEx Yes 0.311 0.240 0.336 0.447

Pretrained Embeddings

Dataset #Pred (including reciprocal predicates) #Ent Model Hyper-parameters Download Link #Params File Size
FB15K-237 474 14,541 ComplEx(1000dim, ours) HPs Link 30M 115MB
WN18RR 22 40,943 ComplEx(1000dim, ours) HPs Link 82M 313M
CoDEx-M 102 17,050 ComplEx(1000dim, ours) HPs Link 34M 131M
CoDEx-S 84 2,034 ComplEx(1000dim, ours) HPs Link 4M 17M
ogbl-biokg 102 93773 ComplEx(1000dim, ours) HPs Link 188M 717M
ogbl-wikikg2 1070 2500604 ComplEx(50dim, ours) HPs Link 250M 955M

Note that we also learn the embeddings for reciprocal predicates as they are reported to be helpful (Dettmers et al., 2018, Lacroix et al., 2018).

How to Use This Repo

How to Use This Repo for OGB Datasets

Edit preprocess_datasets.py and specify the dataset you want to run on, either

datasets = ['ogbl-wikikg2']

or

datasets = ['ogbl-biokg']

Then run preprocess_datasets.py as follows

mkdir data/
python preprocess_datasets.py

After preprocessing is complete, a model can be trained by running main.py. For example, to train ComplEx on ogbl-biokg, use the following command

python main.py --dataset ogbl-biokg --model ComplEx --score_rel True --rank 1000 --learning_rate 1e-1 --batch_size 500 --optimizer Adagrad --regularizer N3 --lmbda 0.01 --w_rel 0.25 --valid 1

and to train a ComplEx on ogbl-wikikg2, use the following command on a GPU with 32GB memory

python main.py --dataset ogbl-wikikg2 --model ComplEx --score_rel True --rank 50 --learning_rate 1e-1 --batch_size 250 --optimizer Adagrad --regularizer N3 --lmbda 0.1 --w_rel 0.125 --valid 1

You should obtain training curves similar as the figures below.

ogbl-biokg ogbl-wikikg2

How to Use This Repo for Conventional KBC Datasets or Customized Datasets

Prepare Datasets

  • Download the datasets and place them under src_data.
  • Name the file containing training triplets as train, validation triplets as valid and test triplets as test. The folder should look like this
src_data/FB15K-237/train # Tab separated file, each row should be like `head    relation    tail`
src_data/FB15K-237/valid # Tab separated file, each row should be like `head    relation    tail`
src_data/FB15K-237/test # Tab separated file, each row should be like `head    relation    tail`
  • After downloading the datasets, the preprocessing is quick and can be completed within a few minutes. First, edit preprocess_datasets.py and specify the dataset you want to run on, e.g.
datasets = ['custom_graph']

then run

mkdir data/
python preprocess_datasets.py

You can download together UMLS, Nations, Kinship, FB15K-237, WN18RR from here and aristo-v4 from here. You can also download some datasets separately on WN18RR and FB15K-237.

Train the model

Use the option score_rel to enable the auxiliary relation prediction objective. Use the option w_rel to set the weight of the relation prediction objective.

For example, the following command trains a ComplEx model with with the auxiliary relation prediction objective on FB15K-237

python main.py --dataset FB15K-237 --score_rel True --model ComplEx --rank 1000 --learning_rate 0.1 --batch_size 1000 --lmbda 0.05 --w_rel 4 --max_epochs 100

And the following command trains a ComplEx model without the auxiliary relation prediction objective on FB15K-237

python main.py --dataset FB15K-237 --score_rel False --model ComplEx --rank 1000 --learning_rate 0.1 --batch_size 1000 --lmbda 0.05 --w_rel 4 --max_epochs 100

Dependencies

  • pytorch
  • wandb

Acknowledgement

This repo is based on the repo kbc, which provides efficient implementations of 1vsAll for ComplEx and CP. Our repo also includes implementations for other models: TransE, RESCAL, and TuckER.

Citation

If you find this repo useful, please cite us

@inproceedings{
chen2021relation,
title={Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations},
author={Yihong Chen and Pasquale Minervini and Sebastian Riedel and Pontus Stenetorp},
booktitle={3rd Conference on Automated Knowledge Base Construction},
year={2021},
url={https://openreview.net/forum?id=Qa3uS3H7-Le}
}

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

Footnotes

  1. The results are taken from OGB Link Property Prediction Leaderboard on ogbl-wikikg2. 2

  2. The results are taken from OGB Link Property Prediction Leaderboard on ogbl-biokg.

  3. The results are taken from the awesome CoDEx repo. 2 3

ssl-relation-prediction's People

Contributors

yihong-chen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ssl-relation-prediction's Issues

The code fails when running a custom graph

Hello,
I am trying to run the code on a graph I have built that is not included among the choices provided in the repo. The graph I am working with is bipartite with typed edges.

To run the code, I prepared the list of triplets to be split in train, valid and test sets.

Triplets are saved in .tsv files:

...
idx__518	id	2449.0
idx__519	id	2452.0
idx__523	id	2469.0
idx__531	id	2484.0
idx__532	id	2487.0
idx__533	id	2494.0
idx__549	id	2545.0
...

To run my dataset, I slightly modified the python scripts in the repo.

In preprocess_dataset.py I added the line datasets = ['mydata'] to read from the folder src/src_data/mydata. I was then able to run the script, which created and filled the folder data/mydata.

DATA_PATH: /content/ssl-relation-prediction/data
Preparing dataset mydata
2681 entities and 9 relations
creating filtering lists
Done processing!
1

In main.py, I modified the list of datasets by adding mydata so that the code wouldn't raise an exception.

Finally, I tried to run the code with the arguments specified in the readme:

! python src/main.py --dataset mydata --score_rel True --model ComplEx --rank 1000 --learning_rate 0.1 --batch_size 1000 --lmbda 0.05 --w_rel 4 --max_epochs 100

Unfortunately, at this point the code fails because there is no train.npy file in the data folder. I assume that the train.npy file should have been created by the preprocessing script, but for some reason that did not happen. The content of the data folder is the following:

total 1320
-rw-r--r-- 1 root root   40662 Dec  3 12:59 ent_id
-rw-r--r-- 1 root root      97 Dec  3 12:59 rel_id
-rw-r--r-- 1 root root   43023 Dec  3 12:59 test.tsv.pickle
-rw-r--r-- 1 root root 1083297 Dec  3 12:59 to_skip.pickle
-rw-r--r-- 1 root root  140727 Dec  3 12:59 train.tsv.pickle
-rw-r--r-- 1 root root   32391 Dec  3 12:59 valid.tsv.pickle

It's not clear to me how to run custom-made datasets from the readme. Could you help me with that?

Trained entity + relation embeddings

Hi, thanks for your great contribution with this work :-) Do you plan to make the trained entity + relation embeddings for your SOTA models available for download?

Tara

About models validation based on MRR

Hi, I am trying your code an I think there is something wrong about the validation step in the training loop.

That is, consider the line at https://github.com/facebookresearch/ssl-relation-prediction/blob/main/src/engines.py#L197.
At this line, the value assigned to the variable split is actually the last value that was assigned in the previous loop for computing step-wise metrics over all the splits (L.192-195).
The last value is in fact "test".

So I think that at the end the model is wrongly validated against the test set instead of the validation set.

Evaluating CP model crashes

Hi Yihong, thank you so much for sharing this awesome repo!!

I tried to run the CP model; training seems to work, but when the validation starts, it crashes:

Evaluate the split train
Evaluating the rhs
Traceback (most recent call last):
  File "/home/jean/kg/src/main.py", line 25, in <module>
    main()
  File "/home/jean/kg/src/main.py", line 20, in main
    engine.episode()
  File "/home/jean/kg/src/engines.py", line 329, in episode
    self.validation_step(epoch=epoch, model=self.model)
  File "/home/jean/kg/src/engines.py", line 260, in validation_step
    res_s = self.dataset.eval(
  File "/home/jean/kg/src/datasets.py", line 243, in eval
    metrics = model.get_metric_ogb(
  File "/home/jean/kg/src/models/__init__.py", line 164, in get_metric_ogb
    cands = self.get_candidates(
TypeError: get_candidates() got an unexpected keyword argument 'indices'

Maybe the get_candidates(...) method of the CP class needs to be re-written?

1vsAll objective and reciprocal triples

Hi,
I have noticed that in your experiments the flag --score_lhs is not enabled, and this flag includes the component $-\log P_\theta(s\mid p,o)$ into loss. In contrast, the 1vsAll objective includes this conditional likelihood, so it seems there is a discrepancy between the objective function in the paper (where there is a conditioning on the subjects) and the one used here.

Is it because you augment the data set with reciprocal triples? If so, is this equivalent to assuming that $P_\theta(S=s\mid R=p,O=o) = P_\theta(O=s\mid R=p^{-1},S=o)$, where $r^{-1}$ denotes the inverse relation?

Thank you

Quick check with the new evaluation code

Hi! This is OGB Team.

We recently released https://github.com/snap-stanford/ogb/releases/tag/1.3.4 which improves the MRR calculation. This should not change your result if your model makes different predictive scores for different triplets, but would penalize those models that give the same predictive scores for different triplets (which is not ideal for link prediction).

Could you update your ogb package to 1.3.4 and run your model again just to confirm everything stays the same (the dataset stays the same. only evaluator changed)? Just one seed should be enough. Thanks!

OGB model documentation

Good evening. I have a question , regarding your code for the ogb wikidata. I've been working on graph neural networks lately, but because it's something new to me I'm having difficulties. Could I find somewhere documentation for the code and more specifically for the methods.Thanks in advance.

BUG of Complex and CP: achieve >99% MRR with only one linear layer.

Hi, Yihong

I found a bug when i change the network of Complex and CP. I only add one linear layer for entity or relation embedding, than achieve unbelievable results:

Epoch: 0 TRAIN: {'MRR': 1.0, 'hits@[1,3,10]': [1.0, 1.0, 1.0]} VALID: {'MRR': 0.9929838180541992, 'hits@[1,3,10]': [0.9928429126739502, 0.993099570274353, 0.9931280612945557]} TEST: {'MRR': 0.9911285042762756, 'hits@[1,3,10]': [0.9909361600875854, 0.9912049174308777, 0.9914003610610962]}

You can reproduce the results by adding rel = self.fc(rel):

class ComplEx(KBCModel):
    def __init__(
            self, sizes: Tuple[int, int, int], rank: int,
            init_size: float = 1e-3
    ):
        super(ComplEx, self).__init__()
        self.fc = nn.Linear(2* rank, 2*rank)

    def score(self, x):
        rel = self.embeddings[1](x[:, 1])
        rel = self.fc(rel)

    def forward(self, x, score_rhs=True, score_rel=False, score_lhs=False):
        rel = self.embeddings[1](x[:, 1])
        rel = self.fc(rel)

I suspect there was a data leak, but I couldn't find the reason. Can you provide any ideas?

Looking forward to your reply.

ogbl-biokg

Hi!thanks for your great work, I use your code to prepare dataset ogbl-biokg recently, i found the size of the test.pickle file and valid.pickle file are both 1.2GB,but train.pickle is 181MB, i'm not familiar with this dataset,so I don't know if it's normal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.