Code Monkey home page Code Monkey logo

pyrdf2vec's Introduction

Logo

Python Versions Downloads Version License

Actions Status Documentation Status Coverage Status Code style: black

Python implementation and extension of RDF2Vec to create a 2D feature matrix from a Knowledge Graph for downstream ML tasks.


text

What is RDF2Vec?

RDF2Vec is an unsupervised technique that builds further on Word2Vec, where an embedding is learned per word, in two ways:

  1. the word based on its context: Continuous Bag-of-Words (CBOW);
  2. the context based on a word: Skip-Gram (SG).

To create this embedding, RDF2Vec first creates "sentences" which can be fed to Word2Vec by extracting walks of a certain depth from a Knowledge Graph.

This repository contains an implementation of the algorithm in "RDF2Vec: RDF Graph Embeddings and Their Applications" by Petar Ristoski, Jessica Rosati, Tommaso Di Noia, Renato De Leone, Heiko Paulheim ([paper] [original code]).

Recently, a book about RDF2Vec was published by Heiko Paulheim, Jan Portisch, and Petar Ristoski. The book is a great introduction to what RDF2Vec is, and what can be done with it. The examples in the book use pyRDF2Vec, so it is recommended to have a look at it!

Getting Started

For most uses-cases, here is how pyRDF2Vec should be used to generate embeddings and get literals from a given Knowledge Graph (KG) and entities:

import pandas as pd

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

# Read a CSV file containing the entities we want to classify.
data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")
entities = [entity for entity in data["location"]]
print(entities)
# [
#    "http://dbpedia.org/resource/Belgium",
#    "http://dbpedia.org/resource/France",
#    "http://dbpedia.org/resource/Germany",
# ]

# Define our knowledge graph (here: DBPedia SPARQL endpoint).
knowledge_graph = KG(
    "https://dbpedia.org/sparql",
    skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
    literals=[
        [
            "http://dbpedia.org/ontology/wikiPageWikiLink",
            "http://www.w3.org/2004/02/skos/core#prefLabel",
        ],
        ["http://dbpedia.org/ontology/humanDevelopmentIndex"],
    ],
)
# Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer(
    Word2Vec(epochs=10),
    walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)],
    # verbose=1
)
# Get our embeddings.
embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
print(embeddings)
# [
#     array([ 1.5737595e-04,  1.1333118e-03, -2.9838676e-04,  ..., -5.3064007e-04,
#             4.3192197e-04,  1.4529384e-03], dtype=float32),
#     array([-5.9027621e-04,  6.1689125e-04, -1.1987977e-03,  ...,  1.1066757e-03,
#            -1.0603866e-05,  6.6087965e-04], dtype=float32),
#     array([ 7.9996325e-04,  7.2907173e-04, -1.9482171e-04,  ...,  5.6251377e-04,
#             4.1435464e-04,  1.4478950e-04], dtype=float32)
# ]

print(literals)
# [
#     [('1830 establishments in Belgium', 'States and territories established in 1830',
#       'Western European countries', ..., 'Member states of the Organisation
#       internationale de la Francophonie', 'Member states of the Union for the
#       Mediterranean', 'Member states of the United Nations'), 0.919],
#     [('Group of Eight nations', 'Southwestern European countries', '1792
#       establishments in Europe', ..., 'Member states of the Union for the
#       Mediterranean', 'Member states of the United Nations', 'Transcontinental
#       countries'), 0.891]
#     [('Germany', 'Group of Eight nations', 'Articles containing video clips', ...,
#       'Member states of the European Union', 'Member states of the Union for the
#       Mediterranean', 'Member states of the United Nations'), 0.939]
#  ]

If you are using a dataset other than MUTAG (where the interested entities have no parents in the KG), it is highly recommended to specify with_reverse=True (defaults to False) in the walking strategy (e.g., RandomWalker). Such a parameter allows Word2Vec to have a better learning window for an entity based on its parents and children and thus predict test data with better accuracy.

In a more concrete way, we provide a blog post with a tutorial on how to use pyRDF2Vec here.

NOTE: this blog uses an older version of pyRDF2Vec, some commands need be to adapted.

If you run the above snippet, you will not necessarily have the same embeddings, because there is no conservation of the random determinism, however it remains possible to do it (SEE: FAQ).

Installation

pyRDF2Vec can be installed in three ways:

  1. from PyPI using pip:
pip install pyRDF2vec
  1. from any compatible Python dependency manager (e.g., poetry):
poetry add pyRDF2vec
  1. from source:
git clone https://github.com/IBCNServices/pyRDF2Vec.git
pip install .

Introduction

To create embeddings for a list of entities, there are two steps to do beforehand:

  1. use a KG;
  2. define a walking strategy.

For more elaborate examples, check the examples folder.

If no sampling strategy is defined, UniformSampler is used. Similarly for the embedding techniques, Word2Vec is used by default.

Use a Knowledge Graph

To use a KG, you can initialize it in three ways:

  1. From a endpoint server using SPARQL:
from pyrdf2vec.graphs import KG

# Defined the DBpedia endpoint server, as well as a set of predicates to
# exclude from this KG and a list of predicate chains to fetch the literals.
KG(
    "https://dbpedia.org/sparql",
    skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
    literals=[
        [
            "http://dbpedia.org/ontology/wikiPageWikiLink",
            "http://www.w3.org/2004/02/skos/core#prefLabel",
        ],
        ["http://dbpedia.org/ontology/humanDevelopmentIndex"],
     ],
 ),
  1. From a file using RDFLib:
from pyrdf2vec.graphs import KG

# Defined the MUTAG KG, as well as a set of predicates to exclude from
# this KG and a list of predicate chains to get the literals.
KG(
    "samples/mutag/mutag.owl",
    skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"},
    literals=[
        [
            "http://dl-learner.org/carcinogenesis#hasBond",
            "http://dl-learner.org/carcinogenesis#inBond",
        ],
        [
            "http://dl-learner.org/carcinogenesis#hasAtom",
            "http://dl-learner.org/carcinogenesis#charge",
        ],
    ],
),
  1. From scratch:
from pyrdf2vec.graphs import KG, Vertex

 GRAPH = [
     ["Alice", "knows", "Bob"],
     ["Alice", "knows", "Dean"],
     ["Dean", "loves", "Alice"],
 ]
 URL = "http://pyRDF2Vec"
 CUSTOM_KG = KG()

 for row in GRAPH:
     subj = Vertex(f"{URL}#{row[0]}")
     obj = Vertex((f"{URL}#{row[2]}"))
     pred = Vertex((f"{URL}#{row[1]}"), predicate=True, vprev=subj, vnext=obj)
     CUSTOM_KG.add_walk(subj, pred, obj)

Define Walking Strategies With Their Sampling Strategy

All supported walking strategies can be found on the Wiki page.

As the number of walks grows exponentially in function of the depth, exhaustively extracting all walks quickly becomes infeasible for larger Knowledge Graphs. In order to avoid this issue, sampling strategies can be applied. These will extract a fixed maximum number of walks per entity and sampling the walks according to a certain metric.

For example, if one wants to extract a maximum of 10 walks of a maximum depth of 4 for each entity using the random walking strategy and Page Rank sampling strategy, the following code snippet can be used:

from pyrdf2vec.samplers import PageRankSampler
from pyrdf2vec.walkers import RandomWalker

walkers = [RandomWalker(4, 10, PageRankSampler())]

Speed up the Extraction of Walks

The extraction of walks can take hours, days if not more in some cases. That's why it is important to use certain attributes and optimize pyRDF2Vec parameters as much as possible according to your use cases.

This section aims to help you to set up these parameters with some advice.

Configure the n_jobs attribute to use multiple processors

By default multiprocessing is disabled (n_jobs=1). If your machine allows it, it is recommended to use multiprocessing by incrementing the number of processors used for the extraction of walks:

from pyrdf2vec.walkers import RandomWalker

RDF2VecTransformer(walkers=[RandomWalker(4, 10, n_jobs=4)])

In the above snippet, the random walking strategy will use 4 processors to extract the walks, whether for a local or remote KG.

WARNING: using a large number of processors may violate the policy of some SPARQL endpoint servers. This being that using multiprocessing means that each processor will send a SPARQL request to one server to fetch the hops of the entity it is processing. Therefore, since these requests may take place in a short time, this server could consider them as a Denial-Of-Service attack (DOS). Of course, these risks are multiplied in the absence of cache and when the entities to be treated are of a consequent number.

Bundle SPARQL requests

By default the SPARQL requests bundling is disabled (mul_req=False). However, if you are using a remote KG and have a large number of entities, this option can greatly speed up the extraction of walks:

import pandas as pd

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")

RDF2VecTransformer(walkers=[RandomWalker(4, 10)]).fit_transform(
    KG("https://dbpedia.org/sparql", mul_req=True),
    [entity for entity in data["location"]],
)

In the above snippet, the KG specifies to the internal connector that it uses, to fetch the hops of the specified entities in an asynchronous way. These hops will then be stored in cache and be accessed by the walking strategy to accelerate the extraction of walks for these entities.

WARNING: bundling SPARQL requests for a number of entities considered too large can may violate the policy of some SPARQL endpoint servers. As for the use of multiprocessing (which can be combined with mul_req), sending a large number of SPARQL requests simultaneously could be seen by a server as a DOS. Be aware that the number of entities you have in your file corresponds to the number of simultaneous requests that will be made and stored in cache.

Modify the Cache Settings

By default, pyRDF2Vec uses a cache that provides a Least Recently Used (LRU) policy, with a size that can hold 1024 entries, and a Time To Live (TTL) of 1200 seconds. For some use cases, you would probably want to modify the cache policy, increase (or decrease) the cache size and/or change the TTL:

import pandas as pd
from cachetools import MRUCache

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")

RDF2VecTransformer(walkers=[RandomWalker(4, 10)]).fit_transform(
    KG("https://dbpedia.org/sparql", cache=MRUCache(maxsize=2048),
    [entity for entity in data["location"]],
)

Modify the Walking Strategy Settings

By default, pyRDF2Vec uses [RandomWalker(2, None, UniformSampler())] as walking strategy. Using a greater maximum depth indicates a longer extraction time for walks. Add to this that using max_walks=None, extracts more walks and is faster in most cases than when giving a number (SEE: FAQ).

In some cases, using another sampling strategy can speed up the extraction of walks by assigning a higher weight to some paths than others:

import pandas as pd

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.graphs import KG
from pyrdf2vec.samplers import PageRankSampler
from pyrdf2vec.walkers import RandomWalker

data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")

RDF2VecTransformer(
    walkers=[RandomWalker(2, None, PageRankSampler())]
).fit_transform(
    KG("https://dbpedia.org/sparql"),
    [entity for entity in data["location"]],
)

Set Up a Local Server

Loading large RDF files into memory will cause memory issues. Remote KGs serve as a solution for larger KGs, but using a public endpoint will be slower due to overhead caused by HTTP requests. For that reason, it is better to set up your own local server and use that for your "Remote" KG.

To set up such a server, a tutorial has been made on our wiki.

Documentation

For more information on how to use pyRDF2Vec, visit our online documentation which is automatically updated with the latest version of the main branch.

From then on, you will be able to learn more about the use of the modules as well as their functions available to you.

Contributions

Your help in the development of pyRDF2Vec is more than welcome.

architecture

The architecture of pyRDF2Vec makes it easy to create new extraction and sampling strategies, new embedding techniques. In order to better understand how you can help either through pull requests and/or issues, please take a look at the CONTRIBUTING file.

FAQ

How to Ensure the Generation of Similar Embeddings?

pyRDF2Vec's walking strategies, sampling strategies and Word2Vec work with randomness. To get reproducible embeddings, you firstly need to use a seed to ensure determinism:

PYTHONHASHSEED=42 python foo.py

Added to this, you must also specify a random state to the walking strategy which will implicitly use it for the sampling strategy:

from pyrdf2vec.walkers import RandomWalker

RandomWalker(2, None, random_state=42)

NOTE: the PYTHONHASHSEED (e.g., 42) is to ensure determinism.

Finally, to ensure random determinism for Word2Vec, you must specify a single worker:

from pyrdf2vec.embedders import Word2Vec

Word2Vec(workers=1)

NOTE: using the n_jobs and mul_req parameters does not affect the random determinism.

Why the Extraction Time of Walks is Faster if max_walks=None?

Currently, the BFS function (using the Breadth-first search algorithm) is used when max_walks=None which is significantly faster than the DFS function (using the Depth-first search algorithm) and extract more walks.

We hope that this algorithmic complexity issue will be solved for the next release of pyRDf2Vec

How to Silence the tcmalloc Warning When Using FastText With Mediums/Large KGs?

Sets the TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD environment variable to a high value.

Referencing

If you use pyRDF2Vec in a scholarly article, we would appreciate a citation:

@inproceedings{pyrdf2vec,
  title        = {pyRDF2Vec: A Python Implementation and Extension of RDF2Vec},
  author       = {Steenwinckel, Bram and Vandewiele, Gilles and Agozzino, Terencio and Ongenae, Femke},
  year         = 2023,
  publisher    = {Springer Nature Switzerland},
  booktitle    = {European Semantic Web Conference},
  doi          = {10.1007/978-3-031-33455-9_28},
  url          = {https://arxiv.org/abs/2205.02283},
  pages        = {471--483},
}

pyrdf2vec's People

Contributors

benedekrozemberczki avatar bsteenwi avatar dependabot[bot] avatar gillesvandewiele avatar mweyns avatar rememberyou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyrdf2vec's Issues

Value error when running the example

When I run the example script (python example.py) I get the following error message:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/matplotlib/axes/_axes.py", line 4238, in scatter
    raise ValueError

I also include the full output for completeness' sake:

$ python3 example.py 
100%|████████████████████████████████████████████████████████████████████████| 74567/74567 [00:01<00:00, 49093.55it/s]
Extracted 52871 walks for 340 instances!
Support Vector Machine: Accuracy = 0.7647058823529411
[[44  1]
 [15  8]]
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/matplotlib/axes/_axes.py", line 4238, in scatter
    raise ValueError
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 95, in <module>
    facecolors=[color_map[i] for i in all_labels],
  File "/usr/lib/python3/dist-packages/matplotlib/pyplot.py", line 2864, in scatter
    is not None else {}), **kwargs)
  File "/usr/lib/python3/dist-packages/matplotlib/__init__.py", line 1812, in inner
    return func(ax, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/matplotlib/axes/_axes.py", line 4245, in scatter
    .format(nc=n_elem, xs=x.size, ys=y.size)
ValueError: 'c' argument has 340 elements, which is not acceptable for use with 'x' with size 272, 'y' with size 272.

Weird b'... elements created during walks

❓ Question

I'm currently working with RDF2Vec and I noticed, that these weird b'… descriptors (e.g. b'\x96]x\xe9\xaf\xbaTu') are used in the sentences during the walks. Are these kind of placeholders?

If yes, is there a way to display the real name of the element/entity?

Store and save rdf2vec transformers

🚀 Feature

It should be possible to easily serialize our RDF2VecTransformer. A load() and save() method should be created. This can probably be done with pickle.

embeddings of resource label

hi ,
i whould to know if we can generate embeddings for the uri ressource label using pyrdf2vec in order to compare it with the embeddings array of a word using word2vec, or rdf2vec accept only uri ressources not rdfs:label ??

pyrdf2vec version

hi,

i 'm woking with gensim version 3.2.8, i want to use pyrdf2vec, i need wich is the pyrdf2vec version that i can use with gensim 3.8.2
thank you

Errors during install

When I try to install using the python -m prefix (full command: python3 -m pip install pyRDF2Vec &> log.out) installation emits the following errors:

        analyzeline(m, pat[1], line)
      File "/usr/lib/python3/dist-packages/numpy/f2py/crackfortran.py", line 1098, in analyzeline
        last_name = updatevars(typespec, selector, attr, edecl)
      File "/usr/lib/python3/dist-packages/numpy/f2py/crackfortran.py", line 1542, in updatevars
        attrspec = [x.strip() for x in markoutercomma(attrspec).split('@,@')]
      File "/usr/lib/python3/dist-packages/numpy/f2py/crackfortran.py", line 833, in markoutercomma
        assert not f, repr((f, line, l))
    AssertionError: (-1, 'intent(in))', 'intent(in))')
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-wz2xi_lr/scipy/setup.py'"'"'; __file__='"'"'/tmp/pip-install-wz2xi_lr/scipy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-krxuqpuw/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/wouter/.local/include/python3.7m/scipy Check the logs for full command output.

I have included the full logs for completeness: out.log

TypeError: init() got an unexpected keyword argument 'label_predicates'

hi,
i create a kg object from dbpedia sparql endpoint

label_predicates = [
'http://www.w3.org/2000/01/rdf-schema#comment',
'http://www.w3.org/2000/01/rdf-schema#label',
'http://www.w3.org/2000/01/rdf-schema#seeAlso',
'http://www.w3.org/2002/07/owl#sameAs',
'http://www.w3.org/2003/01/geo/wgs84_pos#geometry',
'http://dbpedia.org/ontology/wikiPageRedirects',
'http://www.w3.org/2003/01/geo/wgs84_pos#lat',
'http://www.w3.org/2003/01/geo/wgs84_pos#long',
'http://www.w3.org/2004/02/skos/core#exactMatch',
'http://www.w3.org/ns/prov#wasDerivedFrom',
'http://xmlns.com/foaf/0.1/depiction',
'http://xmlns.com/foaf/0.1/homepage',
'http://xmlns.com/foaf/0.1/isPrimaryTopicOf',
'http://xmlns.com/foaf/0.1/name',
'http://dbpedia.org/property/website',
'http://dbpedia.org/property/west',
'http://dbpedia.org/property/wordnet_type',
'http://www.w3.org/2002/07/owl#differentFrom',
]

kg = KG("https://dbpedia.org/sparql", is_remote=True,
label_predicates=[rdflib.URIRef(x) for x in label_predicates])

i get this error

kg = KG("https://dbpedia.org/sparql", is_remote=True,
TypeError: init() got an unexpected keyword argument 'label_predicates'

access to remote sparql endpoint

Could you explain how to use it when we want to access some sparql endpoints such as Yago ("https://yago-knowledge.org/sparql")?
It faces this error:
~/.local/lib/python3.8/site-packages/pyrdf2vec/graphs/kg.py in _get_shops(self, vertex)
113 results = self.endpoint.query().convert()
114 neighbors = []
--> 115 for result in results["results"]["bindings"]:
116 predicate, obj = result["p"]["value"], result["o"]["value"]
117 if predicate not in self.label_predicates:

TypeError: byte indices must be integers or slices, not str

Furthermore, access to endpoints is so slow and when using the files of datasets also needs high memory, so how to handle this problem? Is there any way to access a graph database? if so, how?

The HALK walking strategy is not working as it should

🐛 Bug

The use of multiprocessing results in the use of the HALK walking strategy not being as effective as it should be.

Expected Behavior

A vertex being rare when the ratio of the number of occurrences to the number of extracted walks for all provided entities is smaller than a threshold frequency.

Current Behavior

Currently, the HALK walking strategy considers a vertex to be rare when the ratio of its number of occurrences to the number of extracted walks for a given entity is smaller than a threshold frequency.

Possible Solution

Apply the HALK walking strategy in the Walker class after extracting the walks with RandomWalker.

Install errors using windows 10 python2.7

Trying to install using python 3.7/3.8 gave me similar but more complex errors. So downgraded to try with python 2.7. Get this single error for scikit_learn (see below).

pip install pyrdf2vec
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support`
Collecting pyrdf2vec
Using cached pyRDF2Vec-0.0.3.tar.gz (425 kB)
Collecting gensim==3.5.0
Using cached gensim-3.5.0-cp27-cp27m-win_amd64.whl (23.5 MB)
Collecting matplotlib==2.1.1
Using cached matplotlib-2.1.1-cp27-cp27m-win_amd64.whl (8.4 MB)
Processing c:\users\roos\appdata\local\pip\cache\wheels\68\f8\29\b53346a112a07d30a5a84d53f19aeadaa1a474897c0423af91\networkx-2.2-py2.py3-none-any.whl
Collecting numpy==1.13.3
Using cached numpy-1.13.3-cp27-none-win_amd64.whl (13.0 MB)
Collecting pandas==0.23.4
Using cached pandas-0.23.4-cp27-cp27m-win_amd64.whl (7.3 MB)
Processing c:\users\roos\appdata\local\pip\cache\wheels\8d\f6\b7\f5e9501d0f006fc9fd497c930206952856b2191ab5c836cb97\rdflib-4.2.2-cp27-none-any.whl
ERROR: Could not find a version that satisfies the requirement scikit_learn==0.21.2 (from pyrdf2vec) (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18rc2, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2)
ERROR: No matching distribution found for scikit_learn==0.21.2 (from pyrdf2vec)`

Generate embeddings of new points

Hello,
I have two questions:
1- When we use .fit to generate the embeddings, how to save the model and use it to create the embeddings of new entities of the KG?
2- Before when I used this library, I did not get any error for the label_predicates, while now it returns an unexpected keyword argument!
kg = KG("dataset.xml", label_predicates=[rdflib.URIRef(x) for x in label_predicates])

random value in the rdf2vec embeddings array in each execusion

❓ Question

Hi,
i want to create embeddings using pyrdf2vec , but the problem of random values in the array results always persists, i used the same code and the values in the array are differents in each execusion , i used the same code with PYTHONHASHSEED=42, i can't fount why this happen for me, i used (Word2Vec(size=200), [RandomWalker(1, 200)]) and i have only one entity

also i whould ask if the embeddings produced finish with -03 and 04 like what i get

[array([-1.71203376e-03, 9.66149906e-04, 2.07987218e-03, 2.63648690e-04,
-1.58836076e-03, -2.15106155e-03,

prediction algorithms in rdf2vec

hi ,
i whould ask what is the algorithm used in rdf2vec : cbow or skip-gram , and how we can change from an algorithm to another ?

literals when using categories embeddings

i try to get embeddings of categories not entities : i get the literal like this :
categories =
['http://dbpedia.org/resource/Category:Climatology','http://dbpedia.org/resource/Category:Radiometry','http://dbpedia.org/resource/Category:Anarchism']

[[nan, nan], [nan, nan], [nan, nan]]

Advice about walkers using ingoing and outgoing links

Hello

I'm working to adapt our previous works to pyRdf2Vec. So, we need to build walks which are following in and out links of a vertex.
I see two possibilities:

  • enhance the graph by building the reverse links (it's only possible for local graph)
  • build a method similar to get_hops which add the ingoing links to the outgoing links obtained by get_hops (for exemple, by building hops with virtuel reverse predicates)
    Could you give me some advices for the choice between the solutions and for the implementation of the second one (which is my favorite). I imagine that I need to use the _transition_matrix of the KG.

Hi

❓ Question

nest_asyncio is not added in pyproject.toml

🐛 Bug

Hi all,

my installation from source throws the following Error:

Current Behavior

Traceback (most recent call last):
  File "playground.py", line 3, in <module>
    from pyrdf2vec import RDF2VecTransformer
  File "/home/mblum/anaconda3/envs/rdf2vec/lib/python3.8/site-packages/pyrdf2vec/__init__.py", line 1, in <module>
    import nest_asyncio
ModuleNotFoundError: No module named 'nest_asyncio'

Steps to Reproduce

  1. conda create --name rdf2vec python=3.8
  2. conda activate rdf2vec
  3. git clone https://github.com/IBCNServices/pyRDF2Vec.git
  4. cd pyRDF2Vec/
  5. pip install .
  6. pip install pandas
  7. python playground.py (getting started example)

pyrdf2vec version

hi ,

i need to use gensim 3.8.2 in my code , i whould ask what is the pyrdf2vec version that is consistant with gensim 3.8.2 version?

Access Remote Graphs like Dbpedia,Musicbrainz

According to project now, we access just local graphs but developers and researchers wanna access other graphs such as Dbpedia so somehow we can implement a feature which take endpoint and set of entities so that library create embeddings with this entities.

dataset

is there AIFB MUTAG BGS and AM dataset available?i need the rdf format.

image

Using Wikidata Sparql Endpoint

❓ Question

Hi,
we want to create embeddings for Items from DBpedia and Wikidata using sparql endpoints.

DBpedia works as expected:
kg = KG("https://dbpedia.org/sparql", is_remote=True)
we get fine results

Wikidata
kg = KG("https://query.wikidata.org/sparql", is_remote=True)
edited kg.py to define and add agent in SPARQL Wrapper: self.endpoint = SPARQLWrapper(location, agent=user_agent)
we get results, but they seem to be random numbers, no clusters are recognizable in any way.
Checking against not existing URIs also leads to similar results -> Therefore we assume that with correct URIs there is also no real processing.

Is there something special to take care on when working with wikidata-sparql-endpoint?

BG werner

The license does not indicate the type of licence used

Currently, pyRDF2Vec's license does not have a license type, which is ambiguous for the user who does not know that an MIT license is in fact being used.

It is also recommended to respect the basic outline of this licence (e.g., avoid numbers in front of the list of conditions) and to replace only what is necessary for the licence to be as compliant as possible.

It is for this reason that GitHub does not recognize the license of pyRDF2Vec as MIT:

2020-08-30-093031

Updating the license according to the standards would allow to avoid ambiguities and also to use FOSSA to validate the license check.

According to the MIT license (from the generation of the license by GitHub), the license must be as follows:

MIT License

Copyright (c) <year> <copyright holders>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Therefore, for the pyRDF2Vec repository, the license should be as follows (unless I am mistaken):

MIT License

Copyright (c) 2020 Ghent University and IMEC vzw

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

NOTE: the initial sentence 2020 Ghent University and IMEC vzw with offices at Technologiepark 15, 9052 Ghent, Belgium - Email: [email protected]. can be kept, but beware that the license risks future invalidations either for the email or the location.

As a result of this, @GillesVandewiele do you think the license has the merit to be modified to be in the standard?

Order of rdf triples embeddings

❓ Question

I have generated embeddings for RDF triples URIs (from DBpedia) using pyRDF2vec. When I am passing the list(set(entities)) in the transformer.fir_transform(), I am not sure about the sequence of order of embeddings generated by the pyRDF2vec transformer. Will these sequences or order affect the results when I will concatenate these rdf embeddings with sentence context embeddings while training the model?

`code:

def rdftriplestovec(filepath,entities):

kg = KG(filepath)
transformer = RDF2VecTransformer(walkers=[RandomWalker(3, None)], 
                             embedder=Word2Vec(size=500))
entities_names=[entity.name for entity in kg._entities]
filtered_entities = [e for e in entities if e in entities_names]
not_found = set(entities) -  set(filtered_entities)
print('entities could not be found in the KG! Removing them')
entities = list(set(filtered_entities))
embeddings = transformer.fit_transform(kg, entities)
print(embeddings)
return embeddings`

Sample of rdf triples in .ttl file (the predicate is of owl type): (passing as filepath in rdftriplestovec function)

@Prefix owl: http://www.w3.org/2002/07/owl# .

http://dbpedia.org/resource/AT&T owl:Ontology http://dbpedia.org/resource/Espionage,
http://dbpedia.org/resource/Police .

http://dbpedia.org/resource/Actor owl:Ontology http://dbpedia.org/resource/Major,
http://dbpedia.org/resource/Plea,
http://dbpedia.org/resource/United_States .

http://dbpedia.org/resource/Actor_model owl:Ontology http://dbpedia.org/resource/Visibility .

http://dbpedia.org/resource/Advertising owl:Ontology http://dbpedia.org/resource/Indian_Americans .

http://dbpedia.org/resource/Afghan_National_Army owl:Ontology http://dbpedia.org/resource/Enemy .

http://dbpedia.org/resource/Ago,_Mie owl:Ontology http://dbpedia.org/resource/Haunt_(comics),
http://dbpedia.org/resource/Human_back,
http://dbpedia.org/resource/Jesus


sample: URI list which I get from DBpedia API for my dataset (passing as entities in function rdftriplestovec)

['http://dbpedia.org/resource/United_States_House_of_Representatives',
'http://dbpedia.org/resource/Australian_Democrats',
'http://dbpedia.org/resource/Aide-de-camp',
'http://dbpedia.org/resource/United_Kingdom',
'http://dbpedia.org/resource/Even_language',
'http://dbpedia.org/resource/James_Comey',
'http://dbpedia.org/resource/Letter_(message)',
'http://dbpedia.org/resource/Jason_Chaffetz',
'http://dbpedia.org/resource/Twitter',
'http://dbpedia.org/resource/Italian_language',
'http://dbpedia.org/resource/Robb_Flynn',
'http://dbpedia.org/resource/Hillary_Clinton',
'http://dbpedia.org/resource/Breitbart_News',
'http://dbpedia.org/resource/Truth',
'http://dbpedia.org/resource/Get_(divorce_document)',
'http://dbpedia.org/resource/Inactivated_vaccine',
'http://dbpedia.org/resource/India',
'http://dbpedia.org/resource/Single_(music)',
'http://dbpedia.org/resource/November_2017_Somalia_airstrike',
'http://dbpedia.org/resource/Identified',
'http://dbpedia.org/resource/Iranian_peoples',
'http://dbpedia.org/resource/Woman',
'http://dbpedia.org/resource/Fiction',
'http://dbpedia.org/resource/Unpublished_Story',
'http://dbpedia.org/resource/Stoning']

Relation Prediction

Hello all,
firstly I would like to thank you for a clean implementation. Finally, we have clearn implementation of RDF2Vec to play with :)

I was wondering whether one use RDF2Vec in link prediction task on benchmark datasets like FB15k-237 or WN18RR that are can be found in https://github.com/ibalazevic/TuckER/tree/master/data.

Given subject and object, I would like to perform predicate predication by using RDF2Vec. Any thoughts or scripts are appreciated.

Cheers

ModuleNotFoundError

Hello,
Any thoughts ?

pip install pyRDF2Vec
Collecting pyRDF2Vec
Requirement already satisfied: scikit-learn in 
-------Installing collected packages: pyRDF2Vec
Successfully installed pyRDF2Vec-0.0.5

>>> from rdf2vec.converters import rdflib_to_kg
ModuleNotFoundError: No module named 'rdf2vec.converters'; 'rdf2vec' is not a package 

indeterminist result and different embeddings array in each execusion

hi ,
i run this code in pycharm IDE, but i get another embedding array , not the one in your example

import pandas as pd

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")
entities = [entity for entity in data["location"]]
print(entities)

[

"http://dbpedia.org/resource/Belgium",

"http://dbpedia.org/resource/France",

"http://dbpedia.org/resource/Germany",

]

transformer = RDF2VecTransformer(
Word2Vec(epochs=10),
walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)],
# verbose=1
)
embeddings, literals = transformer.fit_transform(
KG(
"https://dbpedia.org/sparql",
skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
literals=[
[
"http://dbpedia.org/ontology/wikiPageWikiLink",
"http://www.w3.org/2004/02/skos/core#prefLabel",
],
["http://dbpedia.org/ontology/humanDevelopmentIndex"],
],
),
entities
)
print(embeddings)

[

array([ 1.5737595e-04, 1.1333118e-03, -2.9838676e-04, ..., -5.3064007e-04,

4.3192197e-04, 1.4529384e-03], dtype=float32),

array([-5.9027621e-04, 6.1689125e-04, -1.1987977e-03, ..., 1.1066757e-03,

-1.0603866e-05, 6.6087965e-04], dtype=float32),

array([ 7.9996325e-04, 7.2907173e-04, -1.9482171e-04, ..., 5.6251377e-04,

4.1435464e-04, 1.4478950e-04], dtype=float32)

]

the array that i find change in each execusion : i set the envirenement variable PYTHONHASHSEED
and i use random_state=42 , i use pyrdf2vec version 0.2.3

Reproduce the results from the RDF2Vec paper

Hello,
I am trying to reproduce the results from the RDF2Vec paper using your implementation.
More specifically, I am using the generated embeddings for the AIFB dataset for classification with SVM, C4.5, Naive Bayes, and KNN.
However, I obtained results considerably different from the ones presented in RDF2Vec.
Have you tried to reproduce the results? Is it possible that these differences can be due to differences in implementation?
Thanks

rdf2vec updates

hi,
i dosent use pyrdf2vec since last August, i whould ask if there is any updates in rdf2vec commands.
thank you.

Deal with large files

Hi
In order to generate embedding of entities when the size of input KG file is approximately large, what should we do? particularly when the remote KG is also so slow.
Thanks in advance for your reply,
with regards,

Getting error

So, basically I want to use pyrdf2vec library for my personal project work but at first I wanted to try how it works, so for that I tried to install it in my PyCharm community addition 2021 and I tried to run the sample program that is mentioned in the GitHub documentation.

now coming to the errors I am facing after installing the pyrdf2vec library first of all it is installed using the name rdf2vec instead of pyrdf2vec and that too without complete folders for example embedder, graphs and other few folders weren't there. for better understanding see the screenshot below.

Screenshot (79)

now coming to the code the exact sample code I have copy pasted in my editor which obviously throws error on no module name pyrdf2vec but as you can see that when I changed the pyrdf2vec to rdf2vec it is available (see line 2 for rdf2vec and see other lines for pyrdf2vec) the code that i tried to run and explain errors in it is mentioned below.

import pandas as pd
from rdf2vec import RDF2VecTransformer # this line wont throw error
from pyrdf2vec.embedders import Word2Vec # all of these line will throw errors
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker


# Read a CSV file containing the entities we want to classify.
data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")
entities = [entity for entity in data["location"]]
print(entities)
# [
#    "http://dbpedia.org/resource/Belgium",
#    "http://dbpedia.org/resource/France",
#    "http://dbpedia.org/resource/Germany",
# ]

# Define our knowledge graph (here: DBPedia SPARQL endpoint).
knowledge_graph = KG(
    "https://dbpedia.org/sparql",
    skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
    literals=[
        [
            "http://dbpedia.org/ontology/wikiPageWikiLink",
            "http://www.w3.org/2004/02/skos/core#prefLabel",
        ],
        ["http://dbpedia.org/ontology/humanDevelopmentIndex"],
    ],
)
# Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer(
    Word2Vec(epochs=10),
    walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)],
    # verbose=1
)
# Get our embeddings.
embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
print(embeddings)
# [
#     array([ 1.5737595e-04,  1.1333118e-03, -2.9838676e-04,  ..., -5.3064007e-04,
#             4.3192197e-04,  1.4529384e-03], dtype=float32),
#     array([-5.9027621e-04,  6.1689125e-04, -1.1987977e-03,  ...,  1.1066757e-03,
#            -1.0603866e-05,  6.6087965e-04], dtype=float32),
#     array([ 7.9996325e-04,  7.2907173e-04, -1.9482171e-04,  ...,  5.6251377e-04,
#             4.1435464e-04,  1.4478950e-04], dtype=float32)
# ]

print(literals)
# [
#     [('1830 establishments in Belgium', 'States and territories established in 1830',
#       'Western European countries', ..., 'Member states of the Organisation
#       internationale de la Francophonie', 'Member states of the Union for the
#       Mediterranean', 'Member states of the United Nations'), 0.919],
#     [('Group of Eight nations', 'Southwestern European countries', '1792
#       establishments in Europe', ..., 'Member states of the Union for the
#       Mediterranean', 'Member states of the United Nations', 'Transcontinental
#       countries'), 0.891]
#     [('Germany', 'Group of Eight nations', 'Articles containing video clips', ...,
#       'Member states of the European Union', 'Member states of the Union for the
#       Mediterranean', 'Member states of the United Nations'), 0.939]
#  ]

Now due to incomplete library and folders from PyCharm download I tried to install it using commands mentioned in the documentation

pip install pyRDF2vec

and now it is good to go according to me but as the sample file was not available in the package i manually download it and placed somewhere else in my pc and provided the path to it in the code which only change the line from the code

this:
data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")

To this:
data = pd.read_csv(r"C:\Users\Warisahmed\OneDrive\Desktop\gdb\samples\countries-cities\entities.tsv", sep="\t")

and then i run my program which throws an error stating that

C:\Users\Warisahmed\AppData\Local\Programs\Python\Python310\python.exe C:/Users/Warisahmed/PycharmProjects/KnowledgeGraph/n2v.py
['http://dbpedia.org/resource/Belgium', 'http://dbpedia.org/resource/France', 'http://dbpedia.org/resource/Germany', 'http://dbpedia.org/resource/Australia', 'http://dbpedia.org/resource/New_Zealand', 'http://dbpedia.org/resource/Peru', 'http://dbpedia.org/resource/Sri_Lanka', 'http://dbpedia.org/resource/Cyprus', 'http://dbpedia.org/resource/Spain', 'http://dbpedia.org/resource/Portugal', 'http://dbpedia.org/resource/Russia', 'http://dbpedia.org/resource/Brussels', 'http://dbpedia.org/resource/Paris', 'http://dbpedia.org/resource/Berlin', 'http://dbpedia.org/resource/Canberra', 'http://dbpedia.org/resource/Wellington', 'http://dbpedia.org/resource/Lima', 'http://dbpedia.org/resource/Nicosia', 'http://dbpedia.org/resource/Colombo', 'http://dbpedia.org/resource/Madrid', 'http://dbpedia.org/resource/Lisbon', 'http://dbpedia.org/resource/Moscow']
Traceback (most recent call last):
  File "C:\Users\Warisahmed\PycharmProjects\KnowledgeGraph\n2v.py", line 37, in <module>
    embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\pyrdf2vec\rdf2vec.py", line 143, in fit_transform
    self.fit(self.get_walks(kg, entities), is_update)
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\pyrdf2vec\rdf2vec.py", line 163, in get_walks
    if kg.skip_verify is False and not kg.is_exist(entities):
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\pyrdf2vec\graphs\kg.py", line 374, in is_exist
    responses = [self.connector.fetch(query) for query in queries]
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\pyrdf2vec\graphs\kg.py", line 374, in <listcomp>
    responses = [self.connector.fetch(query) for query in queries]
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\cachetools\__init__.py", line 686, in wrapper
    return c[k]
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\cachetools\__init__.py", line 414, in __getitem__
    link = self.__getlink(key)
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\cachetools\__init__.py", line 501, in __getlink
    value = self.__links[key]
  File "C:\Users\Warisahmed\AppData\Roaming\Python\Python310\site-packages\cachetools\keys.py", line 19, in __hash__
    self.__hashvalue = hashvalue = hash(self)
TypeError: unhashable type: 'SPARQLConnector'

Process finished with exit code 1

i hope the issue i am facing is now clear to you and that is why i am unable to understand this issue. so can you explain now what should i do or doing wrong here.

Thank You.

Improve efficiency of sampling strategies

🚀 Feature

Currently, the sampling techniques are rather slow. The depth-first-search (DFS) algorithm can potentially be improved by making use of smarter data structures and techniques such as caching.

Moreover, a very naive system is currently in place to avoid duplicate walks, but this should be improved as well (a tree should be built of things that are already included in the walk. If all children (neighbors) of a node x are already included in the walks, then x should no longer be visited by the DFS.

Remote KG does not check whether provided entities exist.

I have a subset facts of FreeBase dataset in the form of , please take a look at the following example of my input data:

<fb:m.0100zv6s>	<fb:common.topic.notable_for>	<fb:g.1q3sj7rb3>
<fb:m.0100zv6s>	<fb:common.topic.notable_types>	<fb:m.0kpv1_>
<fb:m.0100zv6s>	<fb:music.group_member.membership>	<fb:m.0100zv6q>
<fb:m.0100zv6s>	<fb:people.person.nationality>	<fb:m.03_r3>
<fb:m.0100zv6s>	<fb:people.person.place_of_birth>	<fb:m.03_r3>

Based on the document, I think the KG should be initialized from scratch. Do u think so?
If so, do I have to traverse the scratch(around 2G) line by line to add walks?

error when upgrade pyrdf2vec package

hi, when i upgrade rdf2vec to version 2.0 when i work with this version i get the error
Traceback (most recent call last):

import pyrdf2vec

File "C:\Users\ILINE\PycharmProjects\test2best\venv\lib\site-packages\pyrdf2vec_init_.py", line 1, in
import nest_asyncio

ModuleNotFoundError: No module named 'nest_asyncio'

meaning of the empty list at the end of the embeddings array

hi ,
i whould ask what is meaning by the empty list at the end of the array([array([ 0.0083779 , -0.0097333 , -0.00225981, -0.01057816, 0.01266316,
-0.00550285, 0.01651839, -0.00581943, 0.0165164 , -0.00847934,
-0.02312157, -0.00571326, 0.00177471, 0.01618937, 0.00076064,
0.00521333, 0.01502399, 0.0009116 , -0.01145697, 0.00365165,
-0.00080776, 0.02441511, -0.00221602, -0.0119072 , -0.0165608 ,
0.0101989 , -0.00139808, 0.02303413, 0.0048925 , -0.01682724,
0.00453872, 0.00916675, 0.00147707, -0.00352545, 0.00403542,
0.01361007, -0.00427693, 0.00124139, 0.0119957 , -0.01119698,
0.00649284, -0.01733533, 0.00358338, -0.00396841, 0.00736378,
0.00420047, 0.0108506 , -0.00380876, -0.01578493, -0.01183324,
-0.00355217, 0.00925911, 0.01294927, 0.01107025, 0.00544836,
0.005039 , -0.02145933, -0.01034022, -0.00858134, 0.01886703,
-0.0054844 , -0.01462951, -0.00867096, 0.00334619, -0.00305369,
-0.01331031, 0.00688533, -0.00967618, 0.01809203, 0.0203396 ,
-0.00522942, -0.02041009, -0.00425156, -0.01385427, 0.01587442,
-0.01395543, -0.00869405, -0.00107553, 0.00148736, 0.00621569,
0.00596426, -0.01485674, -0.00726925, 0.0028231 , 0.0276578 ,
0.00896207, 0.0074825 , 0.01189452, -0.00513235, -0.02138061,
-0.00895545, -0.00870362, 0.00290429, -0.00532698, 0.00769983,
0.02115051, 0.01662124, 0.01716363, 0.02318381, -0.0102157 ,
0.00689081, -0.01348563, 0.00428424, 0.0139199 , -0.01293968,
0.01977365, -0.00014994, 0.01049027, -0.01900207, 0.01211957,
0.00123155, 0.0002017 , 0.01996107, 0.0148954 , 0.01931536,
0.0134006 , -0.00888292, 0.01020645, 0.00096673, -0.00736072,
0.01215172, 0.00483533, -0.00454102, -0.00157751, -0.01042432,
-0.003007 , -0.00091393, -0.00382535, -0.00097798, -0.01381498,
-0.01454435, -0.00110119, 0.00912866, -0.00777265, -0.00877869,
-0.0049567 , 0.00672327, -0.01213343, -0.01773899, -0.02463246,
-0.00562515, 0.02292083, -0.00607371, 0.01005943, -0.01188215,
0.00353662, -0.00287847, 0.00377765, -0.00202323, -0.00775664,
-0.01295929, 0.01476607, -0.00485002, -0.006556 , -0.00177124,
-0.02351536, 0.00686879, 0.00336398, 0.00947032, 0.01771237,
0.0120958 , 0.00209924, 0.01471658, 0.00743711, -0.01009965,
-0.00039351, -0.01853843, 0.02391111, -0.00808878, -0.00091525,
-0.01269393, -0.00656295, 0.01068002, 0.00037881, 0.00955251,
0.01083717, -0.00241862, -0.00147978, -0.00854445, 0.00801867,
-0.02087955, -0.00084609, 0.00977517, 0.0114574 , 0.01317277,
0.01666669, 0.0037135 , -0.00474095, 0.01407133, -0.02554555,
0.00857792, -0.00794949, -0.01975685, 0.00501926, 0.00344069,
-0.00242801, 0.00286963, -0.02595226, -0.01242554, -0.00199386],
dtype=float32)], [])

Compilation error during install

When I try to install pyRDF2Vec on Ubuntu I get compilation errors (see below). This is when I run installation with pip install pyRDF2Vec.

    numpy/random/mtrand/mtrand.c:45282:13: error: ‘PyThreadState’ {aka ‘struct _ts’} has no member named ‘exc_traceback’; did you mean ‘curexc_traceback’?
    45282 |     tstate->exc_traceback = local_tb;
          |             ^~~~~~~~~~~~~
          |             curexc_traceback
    error: Command "x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -Inumpy/core/include -Ibuild/src.linux-x86_64-3.7/numpy/core/include/numpy -Inumpy/core/src/private -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/src/npysort -I/usr/include/python3.7m -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath -Ibuild/src.linux-x86_64-3.7/numpy/core/src/private -Ibuild/src.linux-x86_64-3.7/numpy/core/src/npymath -c numpy/random/mtrand/mtrand.c -o build/temp.linux-x86_64-3.7/numpy/random/mtrand/mtrand.o -MMD -MF build/temp.linux-x86_64-3.7/numpy/random/mtrand/mtrand.o.d" failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-noc198ux/numpy/setup.py'"'"'; __file__='"'"'/tmp/pip-install-noc198ux/numpy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-fkhmshxf/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /home/wouter/.local/include/python3.7m/numpy Check the logs for full command output.

I also include the full output for completeness: out.log

Rename rdf2vec to pyrdf2vec

The main folder of pyRDF2Vec is named rdf2vec. To my knowledge, this has never been a problem since the publication of this package was done manually. However, after the integration of poetry for more easily manage our dependencies and the Python packaging, this solution is no longer possible.

Indeed, poetry, defined in the pyproject.toml file (according to PEP 518) is based on the name attribute which defines the folder (currently rdf2vec) that contains the source code of pyRDF2Vec, which must also be the name of the package that should be published.

Source: https://python-poetry.org/docs/cli/#publish

Therefore I propose to rename the rdf2vec main folder to pyrdf2vec. Such a change will slow down the writing of imports, I prefer to discuss this before committing such a change.

@GillesVandewiele should we rename this rdf2vec folder and rename the imports (e.g., from rdf2vec.converters import endpoint_to_kg to from pyrdf2vec.converters import endpoint_to_kg) or continue to publish pyRDF2Vec on PyPi "by hand" for future versions of pyRDF2Vec?

Add FOSSA to check the license

Issue

To date, we do not have a third party organization to verify the validity of a license and ensure its authenticity. A license being fragile to manipulate it is always a good idea to make sure it is well understood and formatted.

Solution

FOSSA helps to streamline license and to deal with vulnerabilities related to it. Its integration is simple, all we need is someone who is a member of IBCNServices and who includes pyRDF2Vec in a dashboard.

Additional context

As part of the FOSSA, each commit will be linked to a report to ensure that the license is properly recognized and understood:

2020-09-07-104139

It will also be possible to add a new badge in the README to ensure that the license is properly formatted and understood:

2020-09-07-104049

Improve user-friendliness of API

When is_remote=False is passed to KG(...) we should check with os library if the first argument is an existing file. This avoids users mixing up is_remote=True and is_remote=False usage.

Additionally, one could check if the provided URI is valid when is_remote=True

pyredf2vec the new version

❓ Question

hi , i whould ask if we need to use the random.seed() method of the random module , or numpy.random.seed() to ensure determinist embedding in the new version of pyrdf2vec
thanks

Add the Codecov Bot

Issue

To date, pyRDF2Vec does not have a Codecov Bot. This results in a lack of information related to coverage during commits. Knowing the coverage of the tests by commit is important to easily see the evolution of the tests.

Solution

According to the Codecov documentation (see link above), one could create an account for the Codecov bot and give it access to the repository. However, I don't have the necessary rights to do so.

Additional context

Here is what the integration of a codecov bot looks like:

2020-09-07-090008

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.