Generate embeddings on the graph, with RDF2Vec - using pyRDF2Vec -, TransE, DistMult - using pyKEEN.
In the config folder, edit appropriately the following files:
object_properties.txt
the list of object properties which will be taken into account for the random walks (1 per line);prefixes.txt
to be added to the SPARQL queriesget_entities.rq
the query for getting the URIs of all entities of interest.
Run the following commands for downloading the data on your machine:
pip install -r requirements.txt
python preprocessing.py
Finally, generate the embeddings using:
python main.py [entities list] [-a algorith_name]
where entities list
is a list of entities uri (1 per line) in a textual file.
Following the generated files by the preprocessing, you can run (for example):
python main.py voc
python main.py smells -a TransE
This is producing an [entity].kv
file, which is a gensim's KeyedVector file.
For generating a dense subgraph, you should run the following scripts
python reduce_graph.py
python generate_dense_subgraph.py
and run again the embedding script with
python main.py [entities list] [-a algorith_name] -d dense_graph.csv
Load embeddings in this way:
emb = KeyedVectors.load('emb.kv')
Search the most similar to a term:
emb.most_similar('http://data.odeuropa.eu/vocabulary/olfactory-objects/269', topn=10) # incense
# 0.7755 http://data.odeuropa.eu/vocabulary/olfactory-objects/267 Frankincense
Refer to gensim's documentation for further possibilities.
We performed experiment for clustering and link predicting using the following code: