Describe the bug
When I ran the custom go.owl file using the example TransE code, it seems that some terms were lost.
How to reproduce
There are mowl wrapped code as follows.
`
import mowl
mowl.init_jvm("20g")
from mowl.projection.edge import Edge
from mowl.projection import TaxonomyProjector
from mowl.datasets.base import PathDataset
dataset = PathDataset("go_cafa3.owl")
from mowl.models import GraphPlusPyKEENModel
from mowl.projection import DL2VecProjector
from pykeen.models import TransE
import torch as th
model = GraphPlusPyKEENModel(dataset)
model.set_projector(DL2VecProjector())
model.set_kge_method(TransE, random_seed=42)
model.optimizer = th.optim.Adam
model.lr = 0.001
model.batch_size = 32
model.train(epochs = 1)
class_embs = model.class_embeddings
role_embs = model.object_property_embeddings
ind_embs = model.individual_embeddings
terms = []
vectors = []
for i,word in enumerate(class_embs):
vector = class_embs[word]
items = word.split('/')
if len(items) > 1:
word = items[-1]
if word.startswith('GO') and not word.endswith('>'):
term = items[-1]
terms.append(term)
vectors.append(vector)
'GO:0005926' in terms
`
False
But GO_0005926 found in owl file like " <owl:Class rdf:about="http://purl.obolibrary.org/obo/GO_0005926"> <obo:IAO_0000231 rdf:resource="http://purl.obolibrary.org/obo/IAO_0000227"/> <obo:IAO_0100001 rdf:resource="http://purl.obolibrary.org/obo/GO_0005925"/> <owl:deprecated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</owl:deprecated>
</owl:Class> ...".
It also occured in pykeen version code like:
`
import mowl
mowl.init_jvm("20g")
from mowl.projection.edge import Edge
from mowl.datasets.builtin import PPIYeastSlimDataset
from mowl.projection import TaxonomyProjector
from mowl.datasets.base import PathDataset
dataset = PathDataset("go.owl")
proj = TaxonomyProjector(True)
edges = proj.project(dataset.ontology)
#edges = [Edge("node1", "rel1", "node3"), Edge("node5", "rel2", "node1"), Edge("node2", "rel1", "node1")] # example of edges
triples_factory = Edge.as_pykeen(edges, create_inverse_triples = True)
from pykeen.models import TransE
pk_model = TransE(triples_factory=triples_factory, embedding_dim = 50, random_seed=42)
from mowl.kge import KGEModel
model = KGEModel(triples_factory, pk_model, epochs = 1, batch_size = 32)
model.train()
ent_embs = model.class_embeddings_dict
rel_embs = model.object_property_embeddings_dict
terms = []
vectors = []
for i,word in enumerate(ent_embs):
vector = ent_embs[word]
items = word.split('/')
if len(items) > 1:
word = items[-1]
if word.startswith('GO') and not word.endswith('>'):
term = items[-1]
terms.append(term)
vectors.append(vector)
'GO_0005926' in terms
`
False
And it can be observed that when running the code `proj = TaxonomyProjector(True)
edges = proj.project(dataset.ontology)
#edges = [Edge("node1", "rel1", "node3"), Edge("node5", "rel2", "node1"), Edge("node2", "rel1", "node1")] # example of edges
triples_factory = Edge.as_pykeen(edges, create_inverse_triples = True)`๏ผ it shows "INFO: Number of ontology classes: 50119", but the final len(terms) is only 42819. Is it because the outdated terms were discarded?
Environment
OS information
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Python version
Python=3.8.13
mOWL version
mowl-borg==0.2.0
JDK version
openjdk 17.0.3-internal 2022-04-19
Additional information
If I need to use embeddings for outdated terms, how should I proceed?