Code Monkey home page Code Monkey logo

pykeen's Introduction

PyKEEN

GitHub Actions License DOI Optuna integrated PyTorch Lightning Code style: black Contributor Covenant

PyKEEN (Python KnowlEdge EmbeddiNgs) is a Python package designed to train and evaluate knowledge graph embedding models (incorporating multi-modal information).

InstallationQuickstartDatasets (37)Inductive Datasets (5)Models (40)SupportCitation

Installation PyPI - Python Version PyPI

The latest stable version of PyKEEN requires Python 3.8+. It can be downloaded and installed from PyPI with:

pip install pykeen

The latest version of PyKEEN can be installed directly from the source code on GitHub with:

pip install git+https://github.com/pykeen/pykeen.git

More information about installation (e.g., development mode, Windows installation, Colab, Kaggle, extras) can be found in the installation documentation.

Quickstart Documentation Status

This example shows how to train a model on a dataset and test on another dataset.

The fastest way to get up and running is to use the pipeline function. It provides a high-level entry into the extensible functionality of this package. The following example shows how to train and evaluate the TransE model on the Nations dataset. By default, the training loop uses the stochastic local closed world assumption (sLCWA) training approach and evaluates with rank-based evaluation.

from pykeen.pipeline import pipeline

result = pipeline(
    model='TransE',
    dataset='nations',
)

The results are returned in an instance of the PipelineResult dataclass that has attributes for the trained model, the training loop, the evaluation, and more. See the tutorials on using your own dataset, understanding the evaluation, and making novel link predictions.

PyKEEN is extensible such that:

  • Each model has the same API, so anything from pykeen.models can be dropped in
  • Each training loop has the same API, so pykeen.training.LCWATrainingLoop can be dropped in
  • Triples factories can be generated by the user with from pykeen.triples.TriplesFactory

The full documentation can be found at https://pykeen.readthedocs.io.

Implementation

Below are the models, datasets, training modes, evaluators, and metrics implemented in pykeen.

Datasets

The following 37 datasets are built in to PyKEEN. The citation for each dataset corresponds to either the paper describing the dataset, the first paper published using the dataset with knowledge graph embedding models, or the URL for the dataset if neither of the first two are available. If you want to use a custom dataset, see the Bring Your Own Dataset tutorial. If you have a suggestion for another dataset to include in PyKEEN, please let us know here.

Name Documentation Citation Entities Relations Triples
Aristo-v4 pykeen.datasets.AristoV4 Chen et al., 2021 42016 1593 279425
BioKG pykeen.datasets.BioKG Walsh et al., 2019 105524 17 2067997
Clinical Knowledge Graph pykeen.datasets.CKG Santos et al., 2020 7617419 11 26691525
CN3l Family pykeen.datasets.CN3l Chen et al., 2017 3206 42 21777
CoDEx (large) pykeen.datasets.CoDExLarge Safavi et al., 2020 77951 69 612437
CoDEx (medium) pykeen.datasets.CoDExMedium Safavi et al., 2020 17050 51 206205
CoDEx (small) pykeen.datasets.CoDExSmall Safavi et al., 2020 2034 42 36543
ConceptNet pykeen.datasets.ConceptNet Speer et al., 2017 28370083 50 34074917
Countries pykeen.datasets.Countries Bouchard et al., 2015 271 2 1158
Commonsense Knowledge Graph pykeen.datasets.CSKG Ilievski et al., 2020 2087833 58 4598728
DB100K pykeen.datasets.DB100K Ding et al., 2018 99604 470 697479
DBpedia50 pykeen.datasets.DBpedia50 Shi et al., 2017 24624 351 34421
Drug Repositioning Knowledge Graph pykeen.datasets.DRKG gnn4dr/DRKG 97238 107 5874257
FB15k pykeen.datasets.FB15k Bordes et al., 2013 14951 1345 592213
FB15k-237 pykeen.datasets.FB15k237 Toutanova et al., 2015 14505 237 310079
Global Biotic Interactions pykeen.datasets.Globi Poelen et al., 2014 404207 39 1966385
Hetionet pykeen.datasets.Hetionet Himmelstein et al., 2017 45158 24 2250197
Kinships pykeen.datasets.Kinships Kemp et al., 2006 104 25 10686
Nations pykeen.datasets.Nations ZhenfengLei/KGDatasets 14 55 1992
NationsL pykeen.datasets.NationsLiteral pykeen/pykeen 14 55 1992
OGB BioKG pykeen.datasets.OGBBioKG Hu et al., 2020 93773 51 5088434
OGB WikiKG2 pykeen.datasets.OGBWikiKG2 Hu et al., 2020 2500604 535 17137181
OpenBioLink pykeen.datasets.OpenBioLink Breit et al., 2020 180992 28 4563407
OpenBioLink LQ pykeen.datasets.OpenBioLinkLQ Breit et al., 2020 480876 32 27320889
OpenEA Family pykeen.datasets.OpenEA Sun et al., 2020 15000 248 38265
PharMeBINet pykeen.datasets.PharMeBINet Königs et al., 2022 2869407 208 15883653
PharmKG pykeen.datasets.PharmKG Zheng et al., 2020 188296 39 1093236
PharmKG8k pykeen.datasets.PharmKG8k Zheng et al., 2020 7247 28 485787
PrimeKG pykeen.datasets.PrimeKG Chandak et al., 2022 129375 30 8100498
Unified Medical Language System pykeen.datasets.UMLS ZhenfengLei/KGDatasets 135 46 6529
WD50K (triples) pykeen.datasets.WD50KT Galkin et al., 2020 40107 473 232344
Wikidata5M pykeen.datasets.Wikidata5M Wang et al., 2019 4594149 822 20624239
WK3l-120k Family pykeen.datasets.WK3l120k Chen et al., 2017 119748 3109 1375406
WK3l-15k Family pykeen.datasets.WK3l15k Chen et al., 2017 15126 1841 209041
WordNet-18 pykeen.datasets.WN18 Bordes et al., 2014 40943 18 151442
WordNet-18 (RR) pykeen.datasets.WN18RR Toutanova et al., 2015 40559 11 92583
YAGO3-10 pykeen.datasets.YAGO310 Mahdisoltani et al., 2015 123143 37 1089000

Inductive Datasets

The following 5 inductive datasets are built in to PyKEEN.

Name Documentation Citation
ILPC2022 Large pykeen.datasets.ILPC2022Large Galkin et al., 2022
ILPC2022 Small pykeen.datasets.ILPC2022Small Galkin et al., 2022
FB15k-237 pykeen.datasets.InductiveFB15k237 Teru et al., 2020
NELL pykeen.datasets.InductiveNELL Teru et al., 2020
WordNet-18 (RR) pykeen.datasets.InductiveWN18RR Teru et al., 2020

Representations

The following 20 representations are implemented by PyKEEN.

Name Reference
Backfill pykeen.nn.BackfillRepresentation
Text Encoding pykeen.nn.BiomedicalCURIERepresentation
Combined pykeen.nn.CombinedRepresentation
Embedding pykeen.nn.Embedding
Featurized Message Passing pykeen.nn.FeaturizedMessagePassingRepresentation
Low Rank Embedding pykeen.nn.LowRankRepresentation
NodePiece pykeen.nn.NodePieceRepresentation
Partition pykeen.nn.PartitionRepresentation
R-GCN pykeen.nn.RGCNRepresentation
Simple Message Passing pykeen.nn.SimpleMessagePassingRepresentation
CompGCN pykeen.nn.SingleCompGCNRepresentation
Subset Representation pykeen.nn.SubsetRepresentation
Tensor-Train pykeen.nn.TensorTrainRepresentation
Text Encoding pykeen.nn.TextRepresentation
Tokenization pykeen.nn.TokenizationRepresentation
Transformed pykeen.nn.TransformedRepresentation
Typed Message Passing pykeen.nn.TypedMessagePassingRepresentation
Visual pykeen.nn.VisualRepresentation
Wikidata Text Encoding pykeen.nn.WikidataTextRepresentation
Wikidata Visual pykeen.nn.WikidataVisualRepresentation

Interactions

The following 34 interactions are implemented by PyKEEN.

Name Reference Citation
AutoSF pykeen.nn.AutoSFInteraction Zhang et al., 2020
BoxE pykeen.nn.BoxEInteraction Abboud et al., 2020
ComplEx pykeen.nn.ComplExInteraction Trouillon et al., 2016
ConvE pykeen.nn.ConvEInteraction Dettmers et al., 2018
ConvKB pykeen.nn.ConvKBInteraction Nguyen et al., 2018
Canonical Tensor Decomposition pykeen.nn.CPInteraction Lacroix et al., 2018
CrossE pykeen.nn.CrossEInteraction Zhang et al., 2019
DistMA pykeen.nn.DistMAInteraction Shi et al., 2019
DistMult pykeen.nn.DistMultInteraction Yang et al., 2014
ER-MLP pykeen.nn.ERMLPInteraction Dong et al., 2014
ER-MLP (E) pykeen.nn.ERMLPEInteraction Sharifzadeh et al., 2019
HolE pykeen.nn.HolEInteraction Nickel et al., 2016
KG2E pykeen.nn.KG2EInteraction He et al., 2015
LineaRE pykeen.nn.LineaREInteraction Peng et al., 2020
MultiLinearTucker pykeen.nn.MultiLinearTuckerInteraction Tucker et al., 1966
MuRE pykeen.nn.MuREInteraction Balažević et al., 2019
NTN pykeen.nn.NTNInteraction Socher et al., 2013
PairRE pykeen.nn.PairREInteraction Chao et al., 2020
ProjE pykeen.nn.ProjEInteraction Shi et al., 2017
QuatE pykeen.nn.QuatEInteraction Zhang et al., 2019
RESCAL pykeen.nn.RESCALInteraction Nickel et al., 2011
RotatE pykeen.nn.RotatEInteraction Sun et al., 2019
Structured Embedding pykeen.nn.SEInteraction Bordes et al., 2011
SimplE pykeen.nn.SimplEInteraction Kazemi et al., 2018
TorusE pykeen.nn.TorusEInteraction Ebisu et al., 2018
TransD pykeen.nn.TransDInteraction Ji et al., 2015
TransE pykeen.nn.TransEInteraction Bordes et al., 2013
TransF pykeen.nn.TransFInteraction Feng et al., 2016
Transformer pykeen.nn.TransformerInteraction Galkin et al., 2020
TransH pykeen.nn.TransHInteraction Wang et al., 2014
TransR pykeen.nn.TransRInteraction Lin et al., 2015
TripleRE pykeen.nn.TripleREInteraction Yu et al., 2021
Tucker pykeen.nn.TuckerInteraction Balažević et al., 2019
Unstructured Model pykeen.nn.UMInteraction Bordes et al., 2014

Models

The following 40 models are implemented by PyKEEN.

Name Model Citation
AutoSF pykeen.models.AutoSF Zhang et al., 2020
BoxE pykeen.models.BoxE Abboud et al., 2020
Canonical Tensor Decomposition pykeen.models.CP Lacroix et al., 2018
CompGCN pykeen.models.CompGCN Vashishth et al., 2020
ComplEx pykeen.models.ComplEx Trouillon et al., 2016
ComplEx Literal pykeen.models.ComplExLiteral Kristiadi et al., 2018
ConvE pykeen.models.ConvE Dettmers et al., 2018
ConvKB pykeen.models.ConvKB Nguyen et al., 2018
CooccurrenceFiltered pykeen.models.CooccurrenceFilteredModel Berrendorf et al., 2022
CrossE pykeen.models.CrossE Zhang et al., 2019
DistMA pykeen.models.DistMA Shi et al., 2019
DistMult pykeen.models.DistMult Yang et al., 2014
DistMult Literal pykeen.models.DistMultLiteral Kristiadi et al., 2018
DistMult Literal (Gated) pykeen.models.DistMultLiteralGated Kristiadi et al., 2018
ER-MLP pykeen.models.ERMLP Dong et al., 2014
ER-MLP (E) pykeen.models.ERMLPE Sharifzadeh et al., 2019
Fixed Model pykeen.models.FixedModel Berrendorf et al., 2021
HolE pykeen.models.HolE Nickel et al., 2016
InductiveNodePiece pykeen.models.InductiveNodePiece Galkin et al., 2021
InductiveNodePieceGNN pykeen.models.InductiveNodePieceGNN Galkin et al., 2021
KG2E pykeen.models.KG2E He et al., 2015
MuRE pykeen.models.MuRE Balažević et al., 2019
NTN pykeen.models.NTN Socher et al., 2013
NodePiece pykeen.models.NodePiece Galkin et al., 2021
PairRE pykeen.models.PairRE Chao et al., 2020
ProjE pykeen.models.ProjE Shi et al., 2017
QuatE pykeen.models.QuatE Zhang et al., 2019
R-GCN pykeen.models.RGCN Schlichtkrull et al., 2018
RESCAL pykeen.models.RESCAL Nickel et al., 2011
RotatE pykeen.models.RotatE Sun et al., 2019
SimplE pykeen.models.SimplE Kazemi et al., 2018
Structured Embedding pykeen.models.SE Bordes et al., 2011
TorusE pykeen.models.TorusE Ebisu et al., 2018
TransD pykeen.models.TransD Ji et al., 2015
TransE pykeen.models.TransE Bordes et al., 2013
TransF pykeen.models.TransF Feng et al., 2016
TransH pykeen.models.TransH Wang et al., 2014
TransR pykeen.models.TransR Lin et al., 2015
TuckER pykeen.models.TuckER Balažević et al., 2019
Unstructured Model pykeen.models.UM Bordes et al., 2014

Losses

The following 15 losses are implemented by PyKEEN.

Name Reference Description
Adversarially weighted binary cross entropy (with logits) pykeen.losses.AdversarialBCEWithLogitsLoss An adversarially weighted BCE loss.
Binary cross entropy (after sigmoid) pykeen.losses.BCEAfterSigmoidLoss The numerically unstable version of explicit Sigmoid + BCE loss.
Binary cross entropy (with logits) pykeen.losses.BCEWithLogitsLoss The binary cross entropy loss.
Cross entropy pykeen.losses.CrossEntropyLoss The cross entropy loss that evaluates the cross entropy after softmax output.
Double Margin pykeen.losses.DoubleMarginLoss A limit-based scoring loss, with separate margins for positive and negative elements from [sun2018]_.
Focal pykeen.losses.FocalLoss The focal loss proposed by [lin2018]_.
InfoNCE loss with additive margin pykeen.losses.InfoNCELoss The InfoNCE loss with additive margin proposed by [wang2022]_.
Margin ranking pykeen.losses.MarginRankingLoss The pairwise hinge loss (i.e., margin ranking loss).
Mean squared error pykeen.losses.MSELoss The mean squared error loss.
Self-adversarial negative sampling pykeen.losses.NSSALoss The self-adversarial negative sampling loss function proposed by [sun2019]_.
Pairwise logistic pykeen.losses.PairwiseLogisticLoss The pairwise logistic loss.
Pointwise Hinge pykeen.losses.PointwiseHingeLoss The pointwise hinge loss.
Soft margin ranking pykeen.losses.SoftMarginRankingLoss The soft pairwise hinge loss (i.e., soft margin ranking loss).
Softplus pykeen.losses.SoftplusLoss The pointwise logistic loss (i.e., softplus loss).
Soft Pointwise Hinge pykeen.losses.SoftPointwiseHingeLoss The soft pointwise hinge loss.

Regularizers

The following 6 regularizers are implemented by PyKEEN.

Name Reference Description
combined pykeen.regularizers.CombinedRegularizer A convex combination of regularizers.
lp pykeen.regularizers.LpRegularizer A simple L_p norm based regularizer.
no pykeen.regularizers.NoRegularizer A regularizer which does not perform any regularization.
normlimit pykeen.regularizers.NormLimitRegularizer A regularizer which formulates a soft constraint on a maximum norm.
orthogonality pykeen.regularizers.OrthogonalityRegularizer A regularizer for the soft orthogonality constraints from [wang2014]_.
powersum pykeen.regularizers.PowerSumRegularizer A simple x^p based regularizer.

Training Loops

The following 3 training loops are implemented in PyKEEN.

Name Reference Description
lcwa pykeen.training.LCWATrainingLoop A training loop that is based upon the local closed world assumption (LCWA).
slcwa pykeen.training.SLCWATrainingLoop A training loop that uses the stochastic local closed world assumption training approach.
symmetriclcwa pykeen.training.SymmetricLCWATrainingLoop A "symmetric" LCWA scoring heads and tails at once.

Negative Samplers

The following 3 negative samplers are implemented in PyKEEN.

Name Reference Description
basic pykeen.sampling.BasicNegativeSampler A basic negative sampler.
bernoulli pykeen.sampling.BernoulliNegativeSampler An implementation of the Bernoulli negative sampling approach proposed by [wang2014]_.
pseudotyped pykeen.sampling.PseudoTypedNegativeSampler A sampler that accounts for which entities co-occur with a relation.

Stoppers

The following 2 stoppers are implemented in PyKEEN.

Name Reference Description
early pykeen.stoppers.EarlyStopper A harness for early stopping.
nop pykeen.stoppers.NopStopper A stopper that does nothing.

Evaluators

The following 5 evaluators are implemented in PyKEEN.

Name Reference Description
classification pykeen.evaluation.ClassificationEvaluator An evaluator that uses a classification metrics.
macrorankbased pykeen.evaluation.MacroRankBasedEvaluator Macro-average rank-based evaluation.
ogb pykeen.evaluation.OGBEvaluator A sampled, rank-based evaluator that applies a custom OGB evaluation.
rankbased pykeen.evaluation.RankBasedEvaluator A rank-based evaluator for KGE models.
sampledrankbased pykeen.evaluation.SampledRankBasedEvaluator A rank-based evaluator using sampled negatives instead of all negatives.

Metrics

The following 44 metrics are implemented in PyKEEN.

Name Interval Direction Description Type
Accuracy $[0, 1]$ 📈 The ratio of the number of correct classifications to the total number. Classification
Area Under The Receiver Operating Characteristic Curve $[0, 1]$ 📈 The area under the receiver operating characteristic curve. Classification
Average Precision Score $[0, 1]$ 📈 The average precision across different thresholds. Classification
Balanced Accuracy Score $[0, 1]$ 📈 The average of recall obtained on each class. Classification
Diagnostic Odds Ratio $[0, ∞)$ 📈 The ratio of positive and negative likelihood ratio. Classification
F1 Score $[0, 1]$ 📈 The harmonic mean of precision and recall. Classification
False Discovery Rate $[0, 1]$ 📉 The proportion of predicted negatives which are true positive. Classification
False Negative Rate $[0, 1]$ 📉 The probability that a truly positive triple is predicted negative. Classification
False Omission Rate $[0, 1]$ 📉 The proportion of predicted positives which are true negative. Classification
False Positive Rate $[0, 1]$ 📉 The probability that a truly negative triple is predicted positive. Classification
Fowlkes Mallows Index $[0, 1]$ 📈 The Fowlkes Mallows index. Classification
Informedness $[-1, 1]$ 📈 The informedness metric. Classification
Matthews Correlation Coefficient $[-1, 1]$ 📈 The Matthews Correlation Coefficient (MCC). Classification
Negative Likelihood Ratio $[0, ∞)$ 📉 The ratio of false positive rate to true positive rate. Classification
Negative Predictive Value $[0, 1]$ 📈 The proportion of predicted negatives which are true negatives. Classification
Number of Scores $[0, ∞)$ 📈 The number of scores. Classification
Positive Likelihood Ratio $[0, ∞)$ 📈 The ratio of true positive rate to false positive rate. Classification
Positive Predictive Value $[0, 1]$ 📈 The proportion of predicted positives which are true positive. Classification
Prevalence Threshold $[0, ∞)$ 📉 The prevalence threshold. Classification
Threat Score $[0, 1]$ 📈 The harmonic mean of precision and recall. Classification
True Negative Rate $[0, 1]$ 📈 The probability that a truly false triple is predicted negative. Classification
True Positive Rate $[0, 1]$ 📈 The probability that a truly positive triple is predicted positive. Classification
Adjusted Arithmetic Mean Rank (AAMR) $[0, 2)$ 📉 The mean over all ranks divided by its expected value. Ranking
Adjusted Arithmetic Mean Rank Index (AAMRI) $[-1, 1]$ 📈 The re-indexed adjusted mean rank (AAMR) Ranking
Adjusted Geometric Mean Rank Index (AGMRI) $(\frac{-E[f]}{1-E[f]}, 1]$ 📈 The re-indexed adjusted geometric mean rank (AGMRI) Ranking
Adjusted Hits at K $(\frac{-E[f]}{1-E[f]}, 1]$ 📈 The re-indexed adjusted hits at K Ranking
Adjusted Inverse Harmonic Mean Rank $(\frac{-E[f]}{1-E[f]}, 1]$ 📈 The re-indexed adjusted MRR Ranking
Geometric Mean Rank (GMR) $[1, ∞)$ 📉 The geometric mean over all ranks. Ranking
Harmonic Mean Rank (HMR) $[1, ∞)$ 📉 The harmonic mean over all ranks. Ranking
Hits @ K $[0, 1]$ 📈 The relative frequency of ranks not larger than a given k. Ranking
Inverse Arithmetic Mean Rank (IAMR) $(0, 1]$ 📈 The inverse of the arithmetic mean over all ranks. Ranking
Inverse Geometric Mean Rank (IGMR) $(0, 1]$ 📈 The inverse of the geometric mean over all ranks. Ranking
Inverse Median Rank $(0, 1]$ 📈 The inverse of the median over all ranks. Ranking
Mean Rank (MR) $[1, ∞)$ 📉 The arithmetic mean over all ranks. Ranking
Mean Reciprocal Rank (MRR) $(0, 1]$ 📈 The inverse of the harmonic mean over all ranks. Ranking
Median Rank $[1, ∞)$ 📉 The median over all ranks. Ranking
z-Geometric Mean Rank (zGMR) $(-∞, ∞)$ 📈 The z-scored geometric mean rank Ranking
z-Hits at K $(-∞, ∞)$ 📈 The z-scored hits at K Ranking
z-Mean Rank (zMR) $(-∞, ∞)$ 📈 The z-scored mean rank Ranking
z-Mean Reciprocal Rank (zMRR) $(-∞, ∞)$ 📈 The z-scored mean reciprocal rank Ranking

Trackers

The following 8 trackers are implemented in PyKEEN.

Name Reference Description
console pykeen.trackers.ConsoleResultTracker A class that directly prints to console.
csv pykeen.trackers.CSVResultTracker Tracking results to a CSV file.
json pykeen.trackers.JSONResultTracker Tracking results to a JSON lines file.
mlflow pykeen.trackers.MLFlowResultTracker A tracker for MLflow.
neptune pykeen.trackers.NeptuneResultTracker A tracker for Neptune.ai.
python pykeen.trackers.PythonResultTracker A tracker which stores everything in Python dictionaries.
tensorboard pykeen.trackers.TensorBoardResultTracker A tracker for TensorBoard.
wandb pykeen.trackers.WANDBResultTracker A tracker for Weights and Biases.

Experimentation

Reproduction

PyKEEN includes a set of curated experimental settings for reproducing past landmark experiments. They can be accessed and run like:

pykeen experiments reproduce tucker balazevic2019 fb15k

Where the three arguments are the model name, the reference, and the dataset. The output directory can be optionally set with -d.

Ablation

PyKEEN includes the ability to specify ablation studies using the hyper-parameter optimization module. They can be run like:

pykeen experiments ablation ~/path/to/config.json

Large-scale Reproducibility and Benchmarking Study

We used PyKEEN to perform a large-scale reproducibility and benchmarking study which are described in our article:

@article{ali2020benchmarking,
  author={Ali, Mehdi and Berrendorf, Max and Hoyt, Charles Tapley and Vermue, Laurent and Galkin, Mikhail and Sharifzadeh, Sahand and Fischer, Asja and Tresp, Volker and Lehmann, Jens},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  title={Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models under a Unified Framework},
  year={2021},
  pages={1-1},
  doi={10.1109/TPAMI.2021.3124805}}
}

We have made all code, experimental configurations, results, and analyses that lead to our interpretations available at https://github.com/pykeen/benchmarking.

Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

If you have questions, please use the GitHub discussions feature at https://github.com/pykeen/pykeen/discussions/new.

Acknowledgements

Supporters

This project has been supported by several organizations (in alphabetical order):

Funding

The development of PyKEEN has been funded by the following grants:

Funding Body Program Grant
DARPA Young Faculty Award (PI: Benjamin Gyori) W911NF2010255
DARPA Automating Scientific Knowledge Extraction (ASKE) HR00111990009
German Federal Ministry of Education and Research (BMBF) Maschinelles Lernen mit Wissensgraphen (MLWin) 01IS18050D
German Federal Ministry of Education and Research (BMBF) Munich Center for Machine Learning (MCML) 01IS18036A
Innovation Fund Denmark (Innovationsfonden) Danish Center for Big Data Analytics driven Innovation (DABAI) Grand Solutions

Logo

The PyKEEN logo was designed by Carina Steinborn

Citation

If you have found PyKEEN useful in your work, please consider citing our article:

@article{ali2021pykeen,
    author = {Ali, Mehdi and Berrendorf, Max and Hoyt, Charles Tapley and Vermue, Laurent and Sharifzadeh, Sahand and Tresp, Volker and Lehmann, Jens},
    journal = {Journal of Machine Learning Research},
    number = {82},
    pages = {1--6},
    title = {{PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings}},
    url = {http://jmlr.org/papers/v22/20-825.html},
    volume = {22},
    year = {2021}
}

pykeen's People

Contributors

cthoyt avatar ddomingof avatar dobraczka avatar huenemoerder avatar jamesmyatt avatar jas-ho avatar kantholtz avatar kdutia avatar kiddozhu avatar labrax avatar lizzalice avatar lorenzobalzani avatar luisawerner avatar lvermue avatar mali-git avatar mberr avatar migalkin avatar nicolafan avatar nudin avatar phaelishall avatar ralphabb avatar rodrigo-a-pereira avatar sbonner0 avatar senjia avatar sharifza avatar sunny1401 avatar tatiana-iazykova avatar tgebhart avatar tobiasuhmann avatar vsocrates avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pykeen's Issues

Bring Your Own Data Examples have a Typo

Describe the bug

In the Bring Your Own Data section of the documentation, the results of a pipeline run are saved to the variable pipeline_result. Directly after result.save_to_directory('test_pre_stratified_transe') is called, triggering an error: NameError: name 'result' is not defined.

To Reproduce
Steps to reproduce the behavior:

  1. Copy pasting the examples from Bring Your Own Data trigger a NameError.

Expected behavior

Rename result to pipeline_result (or vice versa) to prevent the NameError when copy pasting the example.

Environment (please complete the following information):

N/A. The issue is with the documentation.

Additional information

I looked for open or closed issues mentioning this typo and did not find them. Apologies if this typo is already known.

Implement square error loss

Blocked by #18

The square error loss function computes the squared difference between the predicted scores and the labels l_i ∈ {0,1}:

L(t_i,l_i) = (1/2)(f(t_i)−l_i)^2

The squared error loss strongly penalizes predictions that deviate considerably from the labels, and is usually used for regression problems. For simple models it often permits more efficient optimization algorithms involving analytical solutions of sub-problems, e.g. the Alternating Least Squares algorithm

RotatE does not run on gpu

RotatE throws a cpu/gpu tensor mismatch error on the optimizer step when running on gpu

  File "/home/wwymak/code_experiments/pykeen_expts/hello_world_transe.py", line 19, in <module>
    training_kwargs=dict(num_epochs=100)
  File "/home/wwymak/libraries/pykeen/src/pykeen/pipeline.py", line 722, in pipeline
    **training_kwargs,
  File "/home/wwymak/libraries/pykeen/src/pykeen/training/training_loop.py", line 190, in train
    num_workers=num_workers,
  File "/home/wwymak/libraries/pykeen/src/pykeen/training/training_loop.py", line 370, in _train
    self.optimizer.step()
  File "/home/wwymak/anaconda3/envs/deep_graph/lib/python3.7/site-packages/torch/optim/adagrad.py", line 96, in step
    state['sum'].addcmul_(1, grad, grad)
RuntimeError: expected device cpu but got device cuda:0

To Reproduce
running this simple code snippet on a gpu machine

results = pipeline(
        dataset=Kinships,
        model='RotatE',
        random_seed=1235,
        training_kwargs=dict(num_epochs=100)
    )
print(results)

Environment (please complete the following information):

  • OS: ubuntu 18.04
  • Python version: 3.7
  • Version of this software: 1.0.2dev
  • Versions of required Python packages: # run pip list | grep -Eiw 'torch|numpy|Click|click-default-group|tqdm'
    • Torch: 1.4.0
    • Numpy: 1.18.1
    • Click: 7.1.2
    • Click_default_group:1.2.2
    • Tqdm: 4.46.1

(this error also happens on colab)

Add default HPO ranges to loss classes

Since all classes are now implemented directly in PyKEEN, we can move the default HPO ranges into the classes themselves rather than in the external dictionary

question about data format for custom KGs

Thank you for the library. My usecase is to embed relations/entities in ConceptNet. The Conceptnet is not in RDF form, instead it is a csv file. So I need to convert a format compatible with PyKEEN.
I checked your source code and found this example Is this the data format for reading custom KGs?
Thanks in advance.

Unexpected Results with Toy Example

I have created a toy example as follows:

Brussels	locatedIn	Belgium
Belgium	partOf	EU
EU	hasCapital	Brussels

As far as I understand, there should be a trivial embedding solution in two dimensions (the three entities form a triangle, with the three relations being their connecting vectors). The code I use for this is as follows:

results = pipeline(
    training_triples_factory=tf,
    testing_triples_factory=tf,
    model = 'TransE',
    model_kwargs=dict(embedding_dim=2),
    random_seed=1,
    device='cpu',
)

However, the result looks kind of unexpected:
result

I also played with different random seeds and training parameters, but I never came to a sensible solution. DistMult and HolE, among others, come up with a similar solution.

Am I doing something wrong? Or is my conceptual understanding of the expected outcome incorrect?

Implement Pointwise Hinge Loss

Blocked by #18

The pointwise hinge loss sets the score of positive examples larger than a margin parameterλwhile reducing the scores of negative examples to values below−λ:

L(t_i,l_i) = max(0,λ−ˆl_i * f(ti))

where ˆl_i ∈ {−1,1}. The loss penalizes scores of positive examples which are smaller than λ, but does not impose any restriction on value s> λ. Similarly, negative scores larger than−λcontribute to the loss, whereas all values smaller than−λ do not have any loss contribution. Thereby, the model is not encouraged to further optimize triples which are already predicted well enough (according to the margin parameter λ).

Fix usage of nonzero for PyTorch 1.7+

Is your feature request related to a problem? Please describe.
The usage of Tensor.nonzero() (without arguments) has been deprecated, cf. pytorch/pytorch#40187

At least in

filter_batch = (entity_filter_test & relation_filter).nonzero()

we use the now deprecated version.

Describe the solution you'd like
Change all usages of nonzero to use to future-proof variant, i.e. in the above mentioned case

filter_batch = (entity_filter_test & relation_filter).nonzero(as_tuple=False) 

Describe alternatives you've considered
Keep it as it is. This might lead to problems once PyTorch removes the deprecated behaviour.

Additional context
Warning found in this notebook: https://gist.github.com/cthoyt/190233fd98a11306ceb13f2ee0e95a9e (scroll all the way to the bottom)

New Setwise Loss(es)

Blocked by #18

Cross entropy loss and NSSA loss are the only two setwise loss formulations described in our paper. Its also a bit confusing becuase technically in the implementation, NSSA loss is inheriting from PairwiseLoss. Are there other setwise loss functions we could add, so there's more than 1?

Add example of loading a model from a pickle to "Novel Link Prediction" tutorial

Is your feature request related to a problem? Please describe.
I am trying to run experiments from a landmark. When the experiment runs, the results are config files and pkl file. There are files that contain MRR and Hits@1,3,10.

Describe the solution you'd like
I would like to see the predictions, not just the aggregate numerical results (like predictions for all triples). It is not clear how to do that from this point (especially after spending a lot of time training the model, then not being able to access the predictions).

Describe alternatives you've considered
Would it be possible to load from the pkl file and make predictions? I tried this but it did not work out. Please provide an example so this becomes more clear.

Additional context
Add any other context or screenshots about the feature request here.

Add ConceptNet

Issue #2 brought attention to the ConceptNet as a possible dataset to include with PyKEEN. They provide a tab-separated dump of the database here. It is not pre-stratified into training/testing/evaluation sets.

Because this file has additional columns besides head, relation, and tail, its inclusion will also require an updated to the SingleTabbedDataset such that the usecols keyword argument can be specified in the dataset's __init__()

Blocked by #196 because splitting algorithm is currently too slow for big datasets (with more than ~5 million triples)

Add Fractional Hits@K Documentation

Since #17 was merged before I got a chance to make a fuss about there not being documentation, it needs to be revisited.

  1. Since you touched it last, you're responsible for writing the documentation for the RankBasedEvaluator.__init__() which should include explanations of what all parameters do (ks and filtered) as well as an explanation of integer hits@k versus fractional hits@k. References required.
  2. What happens when somebody uses a fractional hits@k downstream? right now all code interprets whatever happens after the @ as an integer. This will break all usages of pipeline() and hpo_pipeline. Please provide unittests showing fractional hits@k in use

Could you publish some experimental results ?

I got the result of TransE on WN18RR dataset as follows, with 100 epoch and default parameter settings.

mean_rank={'best': 7396.350889192887, 'worst': 7396.3558481532145, 'avg': 7396.353368673051}, mean_reciprocal_rank={'best': 0.11426725607718224, 'worst': 0.11426725590337292, 'avg': 0.1142672559902559}, hits_at_k={'best': {1: 0.0032489740082079343, 3: 0.19493844049247605, 5: 0.24811901504787962, 10: 0.2973666210670315}, 'worst': {1: 0.0032489740082079343, 3: 0.19493844049247605, 5: 0.24811901504787962, 10: 0.2973666210670315}, 'avg': {1: 0.0032489740082079343, 3: 0.19493844049247605, 5: 0.24811901504787962, 10: 0.2973666210670315}}, adjusted_mean_rank=0.36492311337297334.

it seems a bit lower than the result reported in the paper

RGCN - ValueError: The number of bases should not exceed the number of relations. (explanation)

I am trying to use the RGCN algorithm and run into the following error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-231451742934> in <module>
      5 training, testing = tf.split()
      6 
----> 7 pipeline_result = pipeline(
      8     training_triples_factory=training,
      9     testing_triples_factory=testing,

~\anaconda3\envs\pykeen\lib\site-packages\pykeen\pipeline.py in pipeline(dataset, dataset_kwargs, training_triples_factory, testing_triples_factory, validation_triples_factory, evaluation_entity_whitelist, evaluation_relation_whitelist, model, model_kwargs, loss, loss_kwargs, regularizer, regularizer_kwargs, optimizer, optimizer_kwargs, clear_optimizer, training_loop, negative_sampler, negative_sampler_kwargs, training_kwargs, stopper, stopper_kwargs, evaluator, evaluator_kwargs, evaluation_kwargs, result_tracker, result_tracker_kwargs, metadata, device, random_seed, use_testing_data)
    711 
    712     model = get_model_cls(model)
--> 713     model_instance: Model = model(
    714         triples_factory=training_triples_factory,
    715         **model_kwargs,

~\anaconda3\envs\pykeen\lib\site-packages\pykeen\models\unimodal\rgcn.py in __init__(self, triples_factory, embedding_dim, automatic_memory_optimization, loss, predict_with_sigmoid, preferred_device, random_seed, num_bases_or_blocks, num_layers, use_bias, use_batch_norm, activation_cls, activation_kwargs, base_model, sparse_messages_slcwa, edge_dropout, self_loop_dropout, edge_weighting, decomposition, buffer_messages)
    231                 num_bases_or_blocks = triples_factory.num_relations // 2 + 1
    232             if num_bases_or_blocks > triples_factory.num_relations:
--> 233                 raise ValueError('The number of bases should not exceed the number of relations.')
    234         elif self.decomposition == 'block':
    235             if num_bases_or_blocks is None:

ValueError: The number of bases should not exceed the number of relations.

I assumed that bases are the same as nodes. When changing the number of relation types to be more than the different number of nodes It seemed to work.

However I want to apply the RGCN approach to a graph with 30 000 different nodes and only 6 different relation types between the nodes. From what I understand about the mathematics behind the RGCN model this should be possible, why is it recommended then in the error to have so many different relation types?

I am also sorry if this is not the platform to ask questions like this, but it seems to be the only platform where I consistently get answers about this package, presumably due to how new it still is.

Data format description

I am having difficulty which format my KG data should be in to be used with pykeen. What is the structure that it should have ?

Predict tails results in KeyError

Describe the bug
After training a model on fb15k237 dataset, I tried running a test with predict_tails.

When using this function, I had a KeyError resulting from calling predict_tails, this is because the predict function calls this line head_id = self.triples_factory.entity_to_id[head_label] which is fetching the dictionary entry for the entity, id pair.

This happened for /m/05hyfin the test set of fb15k237 dataset after training and just calling the predict function. Should a keyerror happen if the key is not found?

To Reproduce
Steps to reproduce the behavior:

  1. Run this code to train the model
from pykeen.triples.triples_factory import TriplesFactory
import pandas as pd
import torch
from pykeen.pipeline import pipeline
from numpy import asarray
from numpy import savetxt
import numpy as np

def main():
    result = pipeline(
        dataset= "fb15k237",
        model="DistMult",
        model_kwargs=dict(embedding_dim= 100),
        regularizer= "no",
        optimizer= "Adam",
        optimizer_kwargs=dict(lr=0.001),
        negative_sampler= "basic",
        negative_sampler_kwargs=dict(
            num_negs_per_pos= 500
        ),
        training_kwargs=dict(
            num_epochs=1,  # just to be fast
            batch_size=256,
        ),
    )
    model = result.model
    tf= TriplesFactory(path='test.txt') #this is the file from https://github.com/ZhenfengLei/KGDatasets/tree/master/FB15k-237
    triples = tf.triples
    results = []	
    model.predict_tails("/m/05hyf", "/film/film_subject/films") #KeyError here
  1. KeyError: '/m/05hyf'

Expected behavior
Runs and results in a dataframe of predictions.

Environment (please complete the following information):

  • OS: Linux
  • Python version: 3.6.5
  • Version of this software: 1.0.4
  • Versions of required Python packages: # run pip list | grep -Eiw 'torch|numpy|Click|click-default-group|tqdm'
    • Torch: 1.6.0
    • Numpy: 1.19.2
    • Click: 7.1.2
    • Click_default_group: 1.2.2
    • Tqdm: 4.49.0

Add 5* Model

On which paper(s) is your requested model based.

https://arxiv.org/abs/2006.04986

List the original and other existing implementations of this model

No link provided in the paper

Relevance of this model

This model is a logical successor of RotatE, ComplEx, etc. @mali-git can also probably the authors to help implement it in PyKEEN since they're in his group

Ability to help and additional context
Your background and ability to help with the implementation of the model.

Improve the doc

Currently, it is difficult to understand the data structure (i.e. what I am passing to the model).
It may be useful to add a description page to help in developing custom models and datasets and therefore, extend the framework.

Furthermore, I think that an example based on Graph Neural Networks (maybe with PyTorch Geometric) will be very appreciated.

Support only BCEAfterSigmoidLoss and BCEWithLogits but not BCELoss

Currently, we support BCE and BCEAfterSigmoidLoss. However, training will fail when training a model with BCE because we do not provide the option predict_with_sigmoid in the scort_hrt/h/r/t() functions (we only support it in the predict_scores_all_heads/tails/relations() functions) .

We could remove the support for blank BCE, or we need to add the option predict_with_sigmoid in the scoring functions.

Add PyKEEN 1.0 History

We need to merge from the private development repository. It will take an awful lot dark magic to get to there:

  1. Make a new branch on pykeen/pykeen (optional - delete everything, and commit?)
  2. Add the following to the .git/config file
[remote "poem"]
    url = [email protected]:mali-git/POEM_develop.git
    fetch = +refs/heads/*:refs/remotes/origin/*
  1. git pull poem master --allow-unrelated-histories --no-commit -S and pray to rebased god. Figured this out with lots of googling and eventually the perfect article, so thanks a bunch to the author @fdv
  2. Commit
  3. Profit

Index out of range in PyTorch embedding using own data

Describe the bug
I am trying to use pykeen on my own data and following the BYOD section of the docs I am doing the following:

from pykeen.triples import TriplesFactory
from pykeen.pipeline import pipeline

tf = TriplesFactory(path="/tmp/exampledata.txt")
training, testing = tf.split()

pipeline_result = pipeline(
    training_triples_factory=training, testing_triples_factory=testing, model="TransH"
)

Unfortunately I get the following:

Using random_state=1333753659 to split TriplesFactory(path="/tmp/out")
No random seed is specified. Setting to 2098687070.
No cuda devices were available. The model runs on CPU
Training epochs on cpu:   0%|                                                           | 0/5 [00:00<?, ?epoch/s]INFO:pykeen.training.training_loop:using stopper: <pykeen.stoppers.stopper.NopStopper object at 0x7f42e4842cc0>
Training epochs on cpu:   0%|                                                           | 0/5 [00:00<?, ?epoch/s]
Traceback (most recent call last):                                                                               
  File "test.py", line 8, in <module>
    training_triples_factory=training, testing_triples_factory=testing, model="TransH"
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/pykeen/pipeline.py", line 815, in pipeline
    **training_kwargs,
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/pykeen/training/training_loop.py", line 190, in train
    num_workers=num_workers,
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/pykeen/training/training_loop.py", line 376, in _train
    slice_size,
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/pykeen/training/training_loop.py", line 438, in _forward_pass
    slice_size=slice_size,
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/pykeen/training/slcwa.py", line 101, in _process_batch
    positive_scores = self.model.score_hrt(positive_batch)
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/pykeen/models/unimodal/trans_h.py", line 134, in score_hrt
    d_r = self.relation_embeddings(hrt_batch[:, 1])
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/dobraczka/.local/share/virtualenvs/embedding-transformers-I3i1Obsv/lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Loading e.g. the kinships dataset this way (i.e. concatenating train.txt, test.txt, valid.txt into a single file to have the same setting) I do not get this error. I have attached the data, which is a simple three-column tsv file.
exampledata.txt

To Reproduce
Steps to reproduce the behavior:

  1. Execute the above script
  2. See error

Expected behavior
Pipeline runs without errors and returns result.

Environment (please complete the following information):

  • OS: Debian Buster
  • Python version: 3.7
  • Version of this software: 1.0.2
  • Versions of required Python packages:
    click 7.1.2
    click-default-group 1.2.2
    numpy 1.19.0
    torch 1.5.1
    tqdm 4.48.0

Save to AWS S3

Enable saving of models to AWS S3 using boto3. Will look very similar to the results from #28.

Add novel link prediction pipeline

Given an entity, its position (either head or tail), and a relation, return a ranked list of all entities and their scores. It should be clear if higher score or lower score means more likely to be a real edge. We might want to have it automatically filter out entities corresponding to edges already in the original knowledge graph.

This should be implemented in the top-level Model class, as all models should have this functionality.

Question: do models trained under LCWA or sLCWA have to behave differently here? Does this functionality have to behave differently based on the loss function?

Implement Pairwise Logistic Loss

Blocked by #18

The pairwise logistic loss is defined as:

L(∆) = log(1 + exp(∆))

Thus, it can be seen as a soft-margin formulation of the pairwise hinge loss (MRL) with a margin of zero.

Update models using complex numbers

As of the PyTorch 1.6 release on July 28th, 2020, there is a native type for tensors of complex numbers (see: https://pytorch.org/docs/stable/complex_numbers.html)

From @mberr: Referencing the 1.7.1 release tracker here, in case they mention something about it: pytorch/pytorch#47622

It would be interesting to test updated implementations of models using complex tensors such as RotatE and ComplEx to see if we can make them more elegant using this new trick. #292 is a good solution, but still not native. #134 presents a solution where the tensors themselves inside the Embedding are assigned the complex dtype like in:

...
# initialize weight outside of torch.nn.Embedding to sneak in the dtype definitiion
_weight = torch.empty((num_embeddings, embedding_dim), dtype=dtype)

self._embeddings = torch.nn.Embedding(
    num_embeddings=num_embeddings,
    embedding_dim=embedding_dim,
    _weight=_weight,
)
...

Then, the math in ComplEx can be updated like in:

# old
(h_re, h_im), (r_re, r_im), (t_re, t_im) = [split_complex(x=x) for x in (h, r, t)]

# new
h_re, h_im = h.real, h.imag
r_re, r_im = r.real, r.imag
t_re, t_im = t.real, t.imag

However, this update is blocked by a major issue - the torch.nn.functional.embedding function does not currently support automatic differentiation on complex tensors.

Traceback (most recent call last):
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/unittest/case.py", line 60, in testPartExecutor
    yield
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/unittest/case.py", line 676, in run
    self._callTestMethod(testMethod)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
    method()
  File "/Users/cthoyt/dev/pykeen/tests/test_models.py", line 439, in test_score_r_with_score_hrt_equality
    raise e
  File "/Users/cthoyt/dev/pykeen/tests/test_models.py", line 431, in test_score_r_with_score_hrt_equality
    scores_r = self.model.score_r(batch)
  File "/Users/cthoyt/dev/pykeen/src/pykeen/models/base.py", line 993, in score_r
    expanded_scores = self.score_hrt(hrt_batch=hrt_batch)
  File "/Users/cthoyt/dev/pykeen/src/pykeen/models/unimodal/complex.py", line 151, in score_hrt
    h = self.entity_embeddings(indices=hrt_batch[:, 0])
  File "/Users/cthoyt/.virtualenvs/pykeen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/cthoyt/dev/pykeen/src/pykeen/nn/emb.py", line 177, in forward
    x = self._embeddings(indices)
  File "/Users/cthoyt/.virtualenvs/pykeen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/cthoyt/.virtualenvs/pykeen/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward
    return F.embedding(
  File "/Users/cthoyt/.virtualenvs/pykeen/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: embedding does not support automatic differentiation for outputs with complex dtype.

What is the estimated time of training/running on multiple GPUs?

Is your feature request related to a problem? Please describe.
I have trained some of the models so far by using the parameters in the experiments config files and some of them have taken an hour or less while others timed out after running on our university's supercomputer for over 23 hours. It's a little frustrating because I can't tell if something is wrong and I lose all the progress after training for a whole day.

Describe the solution you'd like
It would be great if we can have some estimates of how long the models would take so we know if something is wrong? Or maybe some hardware requirements to run them for that given time? I am running on one GPU and I can't tell if this timing is normal.

Describe alternatives you've considered

  • I have made changes in the parameters, I have tried running pykeen experiments command and also just taking the config file and using the parameters in the code by creating a pipeline and the experiments continue to take a long time.
  • Other than taking a long time, when I run experiments on another server with 10 GE Force RTX GPUs with 11019MiB memory each, the code only runs on 1 GPU (changing CUDA_VISIBLE_DEVICES does not change this). Even when I try to load the pickle file and use it to make predictions, I get an out of memory error. Are there instructions on running this on multiple GPUs so that I don't get this error?

Additional context
Out of memory error screenshot here, triggered by running this code:
import torch map_location=torch.device('cpu') model = torch.load('trained_model.pkl') #distmult wn18 predictions_df = model.score_all_triples()

And also by running any predict all_triples experiment on the server I mentioned.
Screen Shot 2020-09-30 at 12 28 02 AM

New hyperbolic models: RotH, RefH, AttH

On which paper(s) is your requested model based.

Low-Dimensional Hyperbolic Knowledge Graph Embeddings
Chami et al. (2021)
https://arxiv.org/abs/2005.00545

List the original and other existing implementations of this model
Official pytorch implementation: https://github.com/HazyResearch/KGEmb
Tensorflow implementation: https://github.com/tensorflow/neural-structured-learning/tree/master/research/kg_hyp_emb

Relevance of this model

  1. Hyperbolic thing blow my mind
  2. More variety
  3. Performing pretty well

Ability to help and additional context
Your background and ability to help with the implementation of the model.

Report marginal evaluation metrics

In the opposite scenario to #62/#83, we want to report several different sets of metrics based on each relation type. For example, for hetionet, you might want to specify edges or groups of edges to evaluate together:

from pykeen.pipeline import pipeline

pipeline_results = pipeline(
    dataset='Hetionet',
    model='RotatE',
    evaluation_relation_whitelist=[
         {'CtD', 'CpD'},  # evaluate the compund treats disease and compounds palliates disease relations together
        'CcSE',  # evaluate the compound causes side effect alone
        {'CdG', 'CuG', 'CbG'},  # compound/gene regulations/bindings
        {'GiG', 'GcG',  'Gr>G'},  # gene/gene relations,
        None,  # special entry None should group all remaining that haven't appeared in the rest
    ],
)
  • It should theoretically possible to have duplicates appearing throughout the subsets.
  • None should be a special entry that groups everything you haven't considered. Maybe raise error if all have been covered and None is also added?
  • all (or something else that can be written in a JSON configuration) should be a special entry that will do everything. Alternatively, "all" could always be reported

Alternate idea, using a dict for named subsets

from pykeen.pipeline import pipeline

pipeline_results = pipeline(
    dataset='Hetionet',
    model='RotatE',
    evaluation_relation_whitelist={
        'Compound-Disease': {'CtD', 'CpD'},
        'Compound-SideEffect': 'CcSE',
        'Compound-Gene': {'CdG', 'CuG', 'CbG'},
        'Gene-Gene': {'GiG', 'GcG',  'Gr>G'},
        'Remainder': None,
    },
)

It might be better overall to force usage of dictionaries so we can switch on the type (mapping vs. collection)

I just read through the pipeline and it seems like this is going to be quite a pain in the butt to do this...

Details on how to reproduce experiments

Can you perhaps help me realize how I can use your framework properly? I tried following the documentation on how to train and evaluate embedding models in pykeen but after the OpenBioLink Dataset caused CUDA oom issues (partly because for most models slicing is not implemented or can I adapt any parameters to fit it?) (sidenote: how are the OBLF1 and F2 supposed to me used?) and I did not get the Neural Network implementations I was particularly interested in using to run I now tested some data set model combination that was actually used in your experiments and still the results seem far off anything you report (I am aware I did not run the same hyper parameter search -because different to your way where 24 hours get you through a good of parameters, for a training time of 107170 sec that was not feasible for a simple try out for me) the results are around hits@3 : 0.0005 and hits@10: 0.0007 for the attatched configuration. I have the 1.0.1-dev version installed and use Geforce GTX1080 2560 Cude-Kerne 11GB GDDR5 AERO GPUs.

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    dataset='WN18RR',
    model='RotatE',
    model_kwargs=dict(
        embedding_dim=500,
        #automatic_memory_optimization=True,
        ),
    training_kwargs=dict(num_epochs=1000),
    optimizer='Adam',
    optimizer_kwargs=dict(
        lr=0.00008,
        weight_decay=0.91,
    ),
    training_loop='sLCWA',
    loss='MarginRankingLoss',
    evaluator='RankBasedEvaluator',
    evaluator_kwargs=dict(filtered=True,),
    negative_sampler="basic",
    negative_sampler_kwargs=dict(num_negs_per_pos=1024)
)
print(pipeline_result)

New split algorithm for constrained evaluation

What about adding a mode that moves all entities that won't get used in evaluation over to training?

I suppose you mean the triples containing entities which are not in evaluation_entity_whitelist? We could think about this, but it would change the split ratio of the subsets' number of triples.

Maybe we add this functionality in our splitting functionalities (and emit appropriate warnings)? Or we look at it from a different perspective: we first split all triples of interest by split_ratio. Then we extend the training set by additional triples. Practically we need to make sure that the ID-mapping stays consistent.

Originally posted by @mberr in #62 (comment)

device mismatch between triples and model when model is in inference mode

Describe the bug
for a model trained on gpu, when I do `model.predict_tails(...) it gives an error of

....
/usr/local/lib/python3.6/dist-packages/pykeen/models/base.py in _novel(self, h, r, t)
    399         """Return if the triple is novel with respect to the training triples."""
    400         triple = torch.tensor(data=[h, r, t], dtype=torch.long, device=self.device).view(1, 3)
--> 401         return (triple == self.triples_factory.mapped_triples).all(dim=1).any().item()
    402 
    403     def predict_scores_all_relations(

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in wrapped(*args, **kwargs)
     26     def wrapped(*args, **kwargs):
     27         try:
---> 28             return f(*args, **kwargs)
     29         except TypeError:
     30             return NotImplemented

RuntimeError: expected device cuda:0 but got device cpu

(not sure if the model should always be on the cpu? but if so would be nice to update the docs to point this out, since if I trained the model on gpu I should expect to be able to do inference on the gpu as well?)

To Reproduce
Steps to reproduce the behavior:

  1. train a model on gpu
  2. do model.predict_tail('entity', 'relation')
  3. device mismatch error

Expected behavior
there shouldn't be a device mismatch

Environment (please complete the following information):

  • OS: ubuntu 18.04
  • Python version: 3.7
  • Version of this software: 1.0.2dev
    Versions of required Python packages: # run pip list | grep -Eiw 'torch|numpy|Click|click-default-group|tqdm'
  • Torch: 1.4.0
  • Numpy: 1.18.1
  • Click: 7.1.2
  • Click_default_group:1.2.2
  • Tqdm: 4.46.1

Additional information
an example on colab here

Improve interpretability of interaction function

Since each interaction model has a different range (e.g., the values of some interaction functions could be between [-inf,inf], some could be [0, inf], some could be [0, 1]), it makes the interpretation difficult. Even worse, this also has to do with interaction models' constrains on embeddings, etc.

It would be good for all interaction functions to define an "interpretation function" that remaps the results onto the range of [0,1]

From @mberr:

Obtaining such is an active research area and neither straight-forward, nor exists a "unique" solution. In case you are interested, you may want to look for keywords "model calibration" (e.g. Platt scaling), and for KG "triple classification".

R-GCN raises Optuna error

If I select the RGCN model, the trials return None:
Optuna init() got an unexpected keyword argument 'base_model_cls' ...
Choices for a categorical distribution should be a tuple of None, bool, int, float and str for persistent storage but contains <function symmetric_edge_weights at 0x2aac1328bb90> which is of type function.
Is there something I didn't realize? With other models it works fine.

Issue with random number generation in TriplesFactory.split

When trying to use my own data I get the following error. I am not very experienced and don't know how to solve this or why this is happening.

ValueError                                Traceback (most recent call last)
<ipython-input-13-ef15ccdc9011> in <module>
      3 
      4 tf = TriplesFactory(path=work_path + '/eucalyptus_triplets.txt')
----> 5 training, testing = tf.split()
      6 
      7 pipeline_result = pipeline(

~\anaconda3\envs\pykeen\lib\site-packages\pykeen\triples\triples_factory.py in split(self, ratios, random_state, randomize_cleanup)
    419         idx = np.arange(n_triples)
    420         if random_state is None:
--> 421             random_state = np.random.randint(0, 2 ** 32 - 1)
    422             logger.warning(f'Using random_state={random_state} to split {self}')
    423         if isinstance(random_state, int):

mtrand.pyx in numpy.random.mtrand.RandomState.randint()

_bounded_integers.pyx in numpy.random._bounded_integers._rand_int32()

ValueError: high is out of bounds for int32

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.