facebookresearch / ssl-relation-prediction Goto Github PK

Simple yet SoTA Knowledge Graph Embeddings.

License: Other

Python 100.00%

ssl-relation-prediction's Introduction

Relation Prediction as an Auxiliary Training Objective for Knowledge Graph Completion

This repo contains the code accompanying the paper: “Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations”. We found that incorporating relation prediction into the 1vsAll objective yields a new self-supervised training objective for knowledge base completion (KBC), which results in significant performance improvement (up to 9.9% in Hits@1) by adding as little as 3–10 lines of code. Unleash the true power of your KBC models with the relation prediction objective!

The codebase also comes with SoTA results on several KBC datasets. Echoing previous research, we find that traditional factorisation-based models, e.g. ComplEx, DistMult and RESCAL, can outperform more recently proposed models when trained appropraitely. For most cases, we find the 1vsAll + Relation Prediction objective to be very effective and require less tweaking than more sophisticated architectures.

⚡ Link Prediction Results

🧩 Pretrained Embeddings

🧭 How to Use This Repo

🥰 Acknowledgement

📃 Citation

✅ Licence

News

01/02/2022 Pretrained ComplEx embeddings on obgl-biokg/ogbl-wikikg2 are released. Check out them here
16/12/2021 Pretrained ComplEx embeddings on FB15K-237/WN18RR/CoDEx-M/CoDEx-S are released. Check out them here
01/12/2021 Hyper-parameters on CoDEx, ogbl-biokg and ogbl-wikikgv2 are released here

⚡ Link Prediction Results

We attempt to include as many results as possible for recent knowledge graph completion datasets and release the hyper-parameters to foster easy reproduction. Feel free to create an issue if you want to suggest additional datasets for us to include.

Currently, we have results on the OGB link property prediction dataset ogbl-biokg, ogbl-wikikg2, codex, Aristo-v4, FB15K237, and WN18RR. All training was done on a single 16GB GPU except for ogbl-wikikg2 which was run on a 32GB GPU.

ogbl-wikikg2

Model	Params	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx (50dim)¹	250M	No	0.3804	-	-	-
ComplEx (250dim)¹	1B	No	0.4027	-	-	-
ComplEx (25dim, ours)	125M	No	0.5161	0.4576	0.5310	0.6324
ComplEx (25dim, ours)	125M	Yes	0.5192	0.4540	0.5394	0.6483
ComplEx (50dim, ours)	250M	No	0.6193	0.5503	0.6468	0.7589
ComplEx (50dim, ours)	250M	Yes	0.6392	0.5684	0.6686	0.7822
ComplEx (100dim, ours)	500M	No	0.6458	0.5750	0.6761	0.7896
ComplEx (100dim, ours)	500M	Yes	0.6509	0.5814	0.6800	0.7923

Note that the training of 50/100 dim takes about 3 days and that additional training time will likely lead to better results. We currently use only one 32GB GPU. Acceleration on multiple GPUs will be considered in the future.

ogbl-biokg

Model	Params	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx ²	188M	No	0.8095	-	-	-
ComplEx (ours)	188M	No	0.8482	0.7887	0.8913	0.9536
ComplEx (ours)	188M	Yes	0.8494	0.7915	0.8902	0.9540

CoDEx-S

Model	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx ³	No	0.465	0.372	0.504	0.646
ComplEx (1000dim, ours)	No	0.472	0.378	0.508	0.658
ComplEx (1000dim, ours)	Yes	0.473	0.375	0.514	0.663

CoDEx-M

Model	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx ³	No	0.337	0.262	0.370	0.476
ComplEx (1000dim, ours)	No	0.351	0.276	0.385	0.492
ComplEx (1000dim, ours)	Yes	0.352	0.277	0.386	0.490

CoDEx-L

Model	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx ³	No	0.294	0.237	0.318	0.400
ComplEx (1000dim, ours)	No	0.342	0.275	0.374	0.470
ComplEx (1000dim, ours)	Yes	0.345	0.277	0.377	0.473

WN18RR

Model	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx	No	0.487	0.441	0.501	0.580
ComplEx	Yes	0.488	0.443	0.505	0.578

FB15K237

Model	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx	No	0.366	0.271	0.401	0.557
ComplEx	Yes	0.388	0.298	0.425	0.568

Aristo-v4

Model	Using RP?	MRR	Hits@1	Hits@3	Hits@10
ComplEx	No	0.301	0.232	0.324	0.438
ComplEx	Yes	0.311	0.240	0.336	0.447

Pretrained Embeddings

Dataset	#Pred (including reciprocal predicates)	#Ent	Model	Hyper-parameters	Download Link	#Params	File Size
FB15K-237	474	14,541	ComplEx(1000dim, ours)	HPs	Link	30M	115MB
WN18RR	22	40,943	ComplEx(1000dim, ours)	HPs	Link	82M	313M
CoDEx-M	102	17,050	ComplEx(1000dim, ours)	HPs	Link	34M	131M
CoDEx-S	84	2,034	ComplEx(1000dim, ours)	HPs	Link	4M	17M
ogbl-biokg	102	93773	ComplEx(1000dim, ours)	HPs	Link	188M	717M
ogbl-wikikg2	1070	2500604	ComplEx(50dim, ours)	HPs	Link	250M	955M

Note that we also learn the embeddings for reciprocal predicates as they are reported to be helpful (Dettmers et al., 2018, Lacroix et al., 2018).

How to Use This Repo

How to Use This Repo for OGB Datasets

Edit preprocess_datasets.py and specify the dataset you want to run on, either

datasets = ['ogbl-wikikg2']

datasets = ['ogbl-biokg']

Then run preprocess_datasets.py as follows

mkdir data/
python preprocess_datasets.py

After preprocessing is complete, a model can be trained by running main.py. For example, to train ComplEx on ogbl-biokg, use the following command

python main.py --dataset ogbl-biokg --model ComplEx --score_rel True --rank 1000 --learning_rate 1e-1 --batch_size 500 --optimizer Adagrad --regularizer N3 --lmbda 0.01 --w_rel 0.25 --valid 1

and to train a ComplEx on ogbl-wikikg2, use the following command on a GPU with 32GB memory

python main.py --dataset ogbl-wikikg2 --model ComplEx --score_rel True --rank 50 --learning_rate 1e-1 --batch_size 250 --optimizer Adagrad --regularizer N3 --lmbda 0.1 --w_rel 0.125 --valid 1

You should obtain training curves similar as the figures below.

ogbl-biokg	ogbl-wikikg2

How to Use This Repo for Conventional KBC Datasets or Customized Datasets

Prepare Datasets

Download the datasets and place them under src_data.
Name the file containing training triplets as train, validation triplets as valid and test triplets as test. The folder should look like this

src_data/FB15K-237/train # Tab separated file, each row should be like `head    relation    tail`
src_data/FB15K-237/valid # Tab separated file, each row should be like `head    relation    tail`
src_data/FB15K-237/test # Tab separated file, each row should be like `head    relation    tail`

After downloading the datasets, the preprocessing is quick and can be completed within a few minutes. First, edit preprocess_datasets.py and specify the dataset you want to run on, e.g.

datasets = ['custom_graph']

then run

mkdir data/
python preprocess_datasets.py

You can download together UMLS, Nations, Kinship, FB15K-237, WN18RR from here and aristo-v4 from here. You can also download some datasets separately on WN18RR and FB15K-237.

Train the model

Use the option score_rel to enable the auxiliary relation prediction objective. Use the option w_rel to set the weight of the relation prediction objective.

For example, the following command trains a ComplEx model with with the auxiliary relation prediction objective on FB15K-237

python main.py --dataset FB15K-237 --score_rel True --model ComplEx --rank 1000 --learning_rate 0.1 --batch_size 1000 --lmbda 0.05 --w_rel 4 --max_epochs 100

And the following command trains a ComplEx model without the auxiliary relation prediction objective on FB15K-237

python main.py --dataset FB15K-237 --score_rel False --model ComplEx --rank 1000 --learning_rate 0.1 --batch_size 1000 --lmbda 0.05 --w_rel 4 --max_epochs 100

Dependencies

pytorch
wandb

Acknowledgement

This repo is based on the repo kbc, which provides efficient implementations of 1vsAll for ComplEx and CP. Our repo also includes implementations for other models: TransE, RESCAL, and TuckER.

Citation

If you find this repo useful, please cite us

@inproceedings{
chen2021relation,
title={Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations},
author={Yihong Chen and Pasquale Minervini and Sebastian Riedel and Pontus Stenetorp},
booktitle={3rd Conference on Automated Knowledge Base Construction},
year={2021},
url={https://openreview.net/forum?id=Qa3uS3H7-Le}
}

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

The results are taken from OGB Link Property Prediction Leaderboard on ogbl-wikikg2. ↩ ↩²
The results are taken from OGB Link Property Prediction Leaderboard on ogbl-biokg. ↩
The results are taken from the awesome CoDEx repo. ↩ ↩² ↩³

ssl-relation-prediction's People

Contributors

Stargazers

Watchers

Forkers

asifyet moqingxinai techthiyanes 3168942 hercules261188 rcap107 yulong-csai loalii kenkoko dzynin hell-to-heaven bluelancer aryopg guankaisi luisawerner quqxui cristina-gabriela eceptonsu

ssl-relation-prediction's Issues

The code fails when running a custom graph

Hello,
I am trying to run the code on a graph I have built that is not included among the choices provided in the repo. The graph I am working with is bipartite with typed edges.

To run the code, I prepared the list of triplets to be split in train, valid and test sets.

Triplets are saved in .tsv files:

...
idx__518	id	2449.0
idx__519	id	2452.0
idx__523	id	2469.0
idx__531	id	2484.0
idx__532	id	2487.0
idx__533	id	2494.0
idx__549	id	2545.0
...

To run my dataset, I slightly modified the python scripts in the repo.

In preprocess_dataset.py I added the line datasets = ['mydata'] to read from the folder src/src_data/mydata. I was then able to run the script, which created and filled the folder data/mydata.

DATA_PATH: /content/ssl-relation-prediction/data
Preparing dataset mydata
2681 entities and 9 relations
creating filtering lists
Done processing!
1

In main.py, I modified the list of datasets by adding mydata so that the code wouldn't raise an exception.

Finally, I tried to run the code with the arguments specified in the readme:

! python src/main.py --dataset mydata --score_rel True --model ComplEx --rank 1000 --learning_rate 0.1 --batch_size 1000 --lmbda 0.05 --w_rel 4 --max_epochs 100

Unfortunately, at this point the code fails because there is no train.npy file in the data folder. I assume that the train.npy file should have been created by the preprocessing script, but for some reason that did not happen. The content of the data folder is the following:

total 1320
-rw-r--r-- 1 root root   40662 Dec  3 12:59 ent_id
-rw-r--r-- 1 root root      97 Dec  3 12:59 rel_id
-rw-r--r-- 1 root root   43023 Dec  3 12:59 test.tsv.pickle
-rw-r--r-- 1 root root 1083297 Dec  3 12:59 to_skip.pickle
-rw-r--r-- 1 root root  140727 Dec  3 12:59 train.tsv.pickle
-rw-r--r-- 1 root root   32391 Dec  3 12:59 valid.tsv.pickle

It's not clear to me how to run custom-made datasets from the readme. Could you help me with that?

Trained entity + relation embeddings

Hi, thanks for your great contribution with this work :-) Do you plan to make the trained entity + relation embeddings for your SOTA models available for download?

Tara

About models validation based on MRR

Hi, I am trying your code an I think there is something wrong about the validation step in the training loop.

That is, consider the line at https://github.com/facebookresearch/ssl-relation-prediction/blob/main/src/engines.py#L197.
At this line, the value assigned to the variable split is actually the last value that was assigned in the previous loop for computing step-wise metrics over all the splits (L.192-195).
The last value is in fact "test".

So I think that at the end the model is wrongly validated against the test set instead of the validation set.

Evaluating CP model crashes

Hi Yihong, thank you so much for sharing this awesome repo!!

I tried to run the CP model; training seems to work, but when the validation starts, it crashes:

Evaluate the split train
Evaluating the rhs
Traceback (most recent call last):
  File "/home/jean/kg/src/main.py", line 25, in <module>
    main()
  File "/home/jean/kg/src/main.py", line 20, in main
    engine.episode()
  File "/home/jean/kg/src/engines.py", line 329, in episode
    self.validation_step(epoch=epoch, model=self.model)
  File "/home/jean/kg/src/engines.py", line 260, in validation_step
    res_s = self.dataset.eval(
  File "/home/jean/kg/src/datasets.py", line 243, in eval
    metrics = model.get_metric_ogb(
  File "/home/jean/kg/src/models/__init__.py", line 164, in get_metric_ogb
    cands = self.get_candidates(
TypeError: get_candidates() got an unexpected keyword argument 'indices'

Maybe the get_candidates(...) method of the CP class needs to be re-written?

where can I find the hyperparameters

Hi, where can I find the hyperparameters of the experimental results on each dataset in the Link Prediction Results section.

1vsAll objective and reciprocal triples

Hi,
I have noticed that in your experiments the flag --score_lhs is not enabled, and this flag includes the component $-\log P_\theta(s\mid p,o)$ into loss. In contrast, the 1vsAll objective includes this conditional likelihood, so it seems there is a discrepancy between the objective function in the paper (where there is a conditioning on the subjects) and the one used here.

Is it because you augment the data set with reciprocal triples? If so, is this equivalent to assuming that $P_\theta(S=s\mid R=p,O=o) = P_\theta(O=s\mid R=p^{-1},S=o)$, where $r^{-1}$ denotes the inverse relation?

Thank you

Quick check with the new evaluation code

Hi! This is OGB Team.

We recently released https://github.com/snap-stanford/ogb/releases/tag/1.3.4 which improves the MRR calculation. This should not change your result if your model makes different predictive scores for different triplets, but would penalize those models that give the same predictive scores for different triplets (which is not ideal for link prediction).

Could you update your ogb package to 1.3.4 and run your model again just to confirm everything stays the same (the dataset stays the same. only evaluator changed)? Just one seed should be enough. Thanks!

OGB model documentation

Good evening. I have a question , regarding your code for the ogb wikidata. I've been working on graph neural networks lately, but because it's something new to me I'm having difficulties. Could I find somewhere documentation for the code and more specifically for the methods.Thanks in advance.

BUG of Complex and CP: achieve >99% MRR with only one linear layer.

Hi, Yihong

I found a bug when i change the network of Complex and CP. I only add one linear layer for entity or relation embedding, than achieve unbelievable results:

Epoch: 0 TRAIN: {'MRR': 1.0, 'hits@[1,3,10]': [1.0, 1.0, 1.0]} VALID: {'MRR': 0.9929838180541992, 'hits@[1,3,10]': [0.9928429126739502, 0.993099570274353, 0.9931280612945557]} TEST: {'MRR': 0.9911285042762756, 'hits@[1,3,10]': [0.9909361600875854, 0.9912049174308777, 0.9914003610610962]}

You can reproduce the results by adding rel = self.fc(rel):

class ComplEx(KBCModel):
    def __init__(
            self, sizes: Tuple[int, int, int], rank: int,
            init_size: float = 1e-3
    ):
        super(ComplEx, self).__init__()
        self.fc = nn.Linear(2* rank, 2*rank)

    def score(self, x):
        rel = self.embeddings[1](x[:, 1])
        rel = self.fc(rel)

    def forward(self, x, score_rhs=True, score_rel=False, score_lhs=False):
        rel = self.embeddings[1](x[:, 1])
        rel = self.fc(rel)

I suspect there was a data leak, but I couldn't find the reason. Can you provide any ideas?

Looking forward to your reply.

ogbl-biokg

Hi!thanks for your great work, I use your code to prepare dataset ogbl-biokg recently, i found the size of the test.pickle file and valid.pickle file are both 1.2GB,but train.pickle is 181MB, i'm not familiar with this dataset,so I don't know if it's normal

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

facebookresearch / ssl-relation-prediction Goto Github PK

ssl-relation-prediction's Introduction

Relation Prediction as an Auxiliary Training Objective for Knowledge Graph Completion

Table of Contents

News

⚡ Link Prediction Results

ogbl-wikikg2

ogbl-biokg

CoDEx-S

CoDEx-M

CoDEx-L

WN18RR

FB15K237

Aristo-v4

Pretrained Embeddings

How to Use This Repo

How to Use This Repo for OGB Datasets

How to Use This Repo for Conventional KBC Datasets or Customized Datasets

Prepare Datasets

Train the model

Dependencies

Acknowledgement

Citation

License

Footnotes

ssl-relation-prediction's People

Contributors

Stargazers

Watchers

Forkers

ssl-relation-prediction's Issues

Recommend Projects

Recommend Topics

Recommend Org