xiangyue9607 / bionev Goto Github PK

Graph Embedding Evaluation / Code and Datasets for "Graph Embedding on Biomedical Networks: Methods, Applications, and Evaluations" (Bioinformatics 2020)

License: MIT License

Python 100.00%

graph-embedding graph-embeddings-evaluation graph-embedding-methods biomedical-networks node-classification link-prediction network-embedding deepwalk node2vec line-embedding

bionev's People

Contributors

Stargazers

Watchers

Forkers

yzabc007 seffnet ddomingof wangyifan00 salvasv maclarin vivian457 anu-bioinfo hurun zymale mgm79 tonydeep evalll z130110 natnaelt jixing475 shouhengtuo shalevy1 starlitnightly ping-zhang hamditarek aabbccgithub nmadali97 ashar799 skjq janice-yang freemanguohua xptree nikolausn yaxche-io irreversibly mstafatmz wngmen oasisye joseph-lang chulhyun webygit fhuang233 wengelearning wgs666 ifv hongyanggao zhonghuatuzi caomaowsh yangbotu liyumingseu youngser medical-projects zillaru mrphil zj-lover xaycq kim095 wutao20001103 xiaoqiongxia dmitrijsk gsilas pbagherzadeh rxa4 yuz998 y-yujie 5l1v3r1 pipaluk5 galkowskim profsradhakrishnan wenmm yaober renanxing-rookie veroinesc dannycg1996 andresgfin

bionev's Issues

Hyper-parameters for word2vec

Hello,

Thank you for this amazing work!
In the context of my PhD thesis, I need to run some comparison of your work with some experimental models, and I need some hyper-parameters that I could not find in either the paper or the supplementary materials.

I wanted to ask you which hyper-parameters were used for the different graphs in the skip-gram based models, specifically:

What is the window size for the context? That is, how near the nodes have to be to the central node to be considered contextual during the embedding process. I see that in your code there is 10 as default value but in small connected graphs such as the ones you have considered that would mean that every node is contextual to every other node in the same connected component if the small world hypothesis holds for these graphs.
What loss function has been used? NCE loss? A complete softmax? If the NCE loss has been used, how many negative samples have been used for the process?

Thank you and have a nice day,
Luca

which pre-trained word2vec model?

Hi,

I am new to this area, i am confused about wod2vec part in node2vec. As there are different pre-trained models of word2vec available, such as generic English and Pubmed models. Which pre-trained model of word2vec has been used in this experiment?

Use pre-trained model to compute embedding in test graph

Hi,

Suppose that we have two graphs, namely training and a test graph. I wonder (1) how to train the node2vec (or any other method) model on the training graph and (2) later use this model for computing embeddings from the test graph.

The important code chunk goes as follows:

from bionev.OpenNE import node2vec
model = node2vec.Node2vec(graph=g_train, path_length=64, num_paths=32, dim=128, p=1, q=1)

Regards, Andrej

Holdouts number

Hello Xiang,

Since I see that in the performance of the various methods from the supplementaries there is a standard deviation, I assume that there must be a number of training holdouts. I haven't found the number of cross-validation holdouts either in the paper or in the supplementary material.

How many holdouts were executed?

Thank you,
Luca

AssertionError Problem

When I try recurring the code, it occurs a AssertionError problem dealing with the Clin_Term_COOC data. Since I didn't change the code , I wonder the reason for this.
"assert len(node_list) == len(embedding_look_up)"

use representations as features

Hi,

Is there a way to save and then use the representations generated here to use as feature in any other network?

testing_ratio

Hi,I tried to change the training and test sets to 9: 1, but the ratio between the training and test sets did not change. For example: when testing_ratio = 0.1 and 0.2, the ratio of links between the original network and the training network is the same. How to solve it? Thank you

Original Graph: nodes: 1133 edges: 5451
Training Graph: nodes: 1133 edges: 4395

all esting_ratio:

Searching 'testing_ratio' in E:\puo\BioNEV-master\src*.py ...
E:\puo\BioNEV-master\src\evaluation.py: 81: def NodeClassification(embedding_look_up, node_list, labels, testing_ratio, seed):
E:\puo\BioNEV-master\src\evaluation.py: 84: testing_ratio=testing_ratio,seed=seed)
E:\puo\BioNEV-master\src\main.py: 30: parser.add_argument('--testing_ratio', default=0.1, type=float,
E:\puo\BioNEV-master\src\utils.py: 52: def split_train_test_graph(input_edgelist, seed, testing_ratio=0.1, weighted=False):
E:\puo\BioNEV-master\src\utils.py: 60: testing_edges_num = int(len(G.edges) * testing_ratio)
E:\puo\BioNEV-master\src\utils.py: 151: def split_train_test_classify(embedding_look_up, X, Y, seed, testing_ratio=0.1):
E:\puo\BioNEV-master\src\utils.py: 153: training_ratio = 1 - testing_ratio
Hits found: 7

node2vec is killed in evaluation phase

Hi,

First thanks for a great paper. However, when I try to reproduce your results, the node2vec method is suddenly killed in the evaluation procedure.

I used the following line to start with node2vec embedding:
bionev --input ./data/Clin_Term_COOC/Clin_Term_COOC.edgelist --output ./embeddings/node2vec.txt --method node2vec --task link-prediction --eval-result-file eval_results2.txt --weighted True

The output lines:

######################################################################
Embedding Method: node2vec, Evaluation Task: link-prediction
######################################################################
Original Graph: nodes: 48651 edges: 1659249
Training Graph: nodes: 48651 edges: 1328307
Loading training graph for learning embedding...
Graph Loaded...
Preprocess transition probs...
Begin random walk...
Walk finished...
Learning representation...
Saving embeddings...
Embedding Learning Time: 10034.56 s
Nodes with embedding: 48651
Begin evaluation...
Killed

I have 256GB memory on my server so I suspect that RAM is not in-game. When I try with a smaller dataset, the evaluation phase ended successfully.

Any idea what do do?

Best, Andrej

Include models from PyKEEN

There are several KGE models implemented in https://github.com/smartDataAnalytics/PyKEEN from @mali-git that cover translational distance models (e.g., TransE, TransH, TransR, TransD, UM, SE) and semantic matching models (e.g., RESCAL, DistMult, ERMLP, ConvE) that weren't mentioned in the README of this repo. I'm also aware that some, but not all, of these models have been made available through the packages that you've already integrated.

PyKEEN was developed with specific focus on reusability, so I hope that we can either make a PR to include a wrapper to make it work the same way as your models, or you might become interested and make use of it.

Make code pip installable

It looks like the code is organized such that it can be pip installable. This would make dependency management and reusability a lot better. I will submit a PR :)

ValueError problem

When I run the code, it occurs a ValueError problem dealing with the Clin_Term_COOC data.
"multilabel-indicator is not supported"

ValueError: not enough values to unpack (expected 3, got 2)

I run SDNE,DeepWalk(openNE) wrong. as follow:

Original Graph: nodes: 1133 edges: 5451
Training Graph: nodes: 1133 edges: 3874
Loading training graph for learning embedding...
Traceback (most recent call last):
File "main.py", line 198, in
more_main()
File "main.py", line 194, in more_main
main(parse_args())
File "main.py", line 126, in main
embedding_training(args, train_graph_filename)
File "F:\BioNEV-master\src\embed_train.py", line 25, in embedding_training
g = read_for_OpenNE(train_graph_filename, weighted=args.weighted)
File "F:\BioNEV-master\src\utils.py", line 17, in read_for_OpenNE
G.read_edgelist(filename=filename, weighted=weighted)
File "F:\BioNEV-master\src\OpenNE\graph.py", line 79, in read_edgelist
func(l)
File "F:\BioNEV-master\src\OpenNE\graph.py", line 65, in read_weighted
src, dst, w = l.split()
ValueError: not enough values to unpack (expected 3, got 2)
How to deal with? thank

struc2vec

When I implemented struc2vec, I encountered the following problems:
######################################################################
Original Graph: nodes: 1133 edges: 5451
Training Graph: nodes: 1133 edges: 4923
Loading training graph for learning embedding...
Graph Loaded...
Traceback (most recent call last):
File "main.py", line 197, in
more_main()
File "main.py", line 193, in more_main
main(parse_args())
File "main.py", line 125, in main
embedding_training(args, train_graph_filename)
File "E:\puo\BioNEV-master\src\bionev\embed_train.py", line 27, in embedding_training
embedding_training(args, G=g)
File "E:\puo\BioNEV-master\src\bionev\embed_train.py", line 37, in embedding_training
format='%(asctime)s %(message)s')
File "E:\puo\lib\logging_init.py", line 1808, in basicConfig
h = FileHandler(filename, mode)
File "E:\puo\lib\logging_init_.py", line 1032, in init
StreamHandler.init(self, self.open())
File "E:\puo\lib\logging_init.py", line 1061, in _open
return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'E:\puo\BioNEVmaster\src\src\bionev\struc2vec\struc2vec.log'

How can I solve it? Thank you

Reproduction question: how were the datasets generated?

This repository includes five nice datasets which makes reproduction much easier, but it's missing whatever scripts (or instructions if manually done) were used to generate them. This would be helpful not only to ensure the correctness of the training data, but to enable periodic reproduction as the underlying databases are updated