Code Monkey home page Code Monkey logo

Comments (19)

Dru-Mara avatar Dru-Mara commented on August 22, 2024 1

Hi,

We define precision@k as the number of edges (i,j) \in E correctly recovered within the first k pairs with highest similarity. For NR, we basically: 1) take all node pairs in the network 2) compute their probability of being linked, 3) sort them in descending order 4) look at the first k predictions and see how many of those are actual edges in the input graph. That value is then reported as the precision at k.

For very large graphs step 1 is unfeasible (because we would have to compute almost n*n similarities). Therefore, we generally take a random sample of node pairs (e.g. 10% of them) and compute an "approximated" precision@k.

I hope this helps,
Alex

from evalne.

Dru-Mara avatar Dru-Mara commented on August 22, 2024

Hi,

LP and NR are in some sense very similar tasks since they both boil down to a binary classification problem. However, there are two fundamental differences: 1) in the usual NR evaluation pipeline we need to make many more predictions than for LP and 2) in NR the complete graph is used for training, validation and testing. More about each task:

Link Prediction
When evaluating a method for LP you want to see how well it can distinguish between pairs of nodes which are not currently connected in the input graph but should be and those which should not. To evaluate LP we generally split a network in a set of train edges and a set of test edges. We also add equal amounts of non-connected edge pairs for training and testing and run the binary classification task. The results can then be measured using for example AUC. So if we have a graph with e.g. N=1000 nodes we would need to predict the probability of linking 2000 pairs.

Network reconstruction
In NR, on the other hand, we want to see how well an embedding method has captured the structure of the input network. This is equivalent so seeing if we can go back from embeddings to the original adjacency matrix of the input graph. In order to do that we would need to predict for every possible pair of edges the probability of linking them. The better the method works, the closer this probability matrix will be to the original adjacency matrix of the graph. So if you have a graph with N=1000 nodes we will need to compute predictions for N^2 possible pairs (1000000 predictions).
As this is infeasible for large graphs, we generally take a random subset of the graph edges. Let's say we take 1% of all those 1000000 possible edge pairs, that's 10000 predictions we still need to make. Out of those, let's say about 500 are real edges and 9500 are non-edges. Now AUC in this case (unlike for LP) doesn't make sense, because there is a huge class imbalance (more non-edges than true edges), so we use a different metric (usually precision@K). To compute it we need to sort all those 10000 predictions in descending order, take the first K and see how many of those are actual edges. A perfect method should always give you a precision@K for K<500 equal to 1.

In EvalNE
In the library NR is computed in the following steps:

  1. Relabel the input graph nodes to sequential integers in 0...N

  2. Select a random sample of node pairs from the graph e.g. 1%
    pos_e, neg_e = stt.random_edge_sample(nx.adj_matrix(G), 0.01, nx.is_directed(G))

  3. Initialize a traintest_split with those sets of edges and non-edges. Also initialize the train graph to be the complete graph G:
    traintest_split.set_splits(train_E=pos_e, train_E_false=neg_e, test_E=None, test_E_false=None, directed=nx.is_directed(G), nw_name='test_network', TG=G)

  4. Initialize the NREvaluator. No validation split is needed since hyper-parameter tunning is done directly on the input graph:
    nee = NREvaluator(traintest_split, embed_dim, lp_model)​

Note: When using the library as an API the user can decide to evaluate NR similarly to LP and obtain AUC scores. All graph edges should be considered train edges and test edges should be 0. The same number of train non-edges as train edges have to be obtained so that the AUC score makes sense. The only difference in this case w.r.t. the LP evaluation would be that hyper-parameter tuning is done on the complete graph and the final reported scores are the train scores.

from evalne.

LPioL avatar LPioL commented on August 22, 2024

Thanks a lot for this answer that helped a lot. And what is the next step for evaluate the embedding, can I use nee.evaluate_ne ?
Indeed, I have my own method of embedding as I asked in a previous issue and this method was used to solve this problem.

from evalne.

Dru-Mara avatar Dru-Mara commented on August 22, 2024

Yes, you can indeed use the nee.evaluate_ne() method. In fact, all methods from the LPEvaluator are inherited by the NREvaluator, so you don't have to worry about anything else :)

from evalne.

LPioL avatar LPioL commented on August 22, 2024

Re-reading your answer, can you precise how the test and train phases are done with network reconstruction ? As we have to predict on all the true and false edges of the network, how do you train the logistic learner ?

from evalne.

Dru-Mara avatar Dru-Mara commented on August 22, 2024

Hi,
There is no difference in how the LR classifier is trained between LP and NR. The only difference is in the numbers of edges and non-edges for training. For LP we select the same amounts, while for NR we select either all edges and non-edges or a subset of them (using the random_sample() function I mentioned earlier). For NR we also don't need test edges since we are only interested in the train scores.

So a very simplified LP pipeline would be:
Network -> split tr/te edges -> generate non-edges -> compute NE using the graph spanned by tr edges
-> compute EE for tr/te -> train LR with tr EE and tr labels -> use LR to predict labels of te EE -> compute test score

For NR we have:
Network -> sample x% of all possible edges (you will have more non-edges than edges) -> compute NE using a graphs spanned by the true edges in your sample -> compute EE for tr (all the edges you sampled) -> train LR with tr EE and tr labels -> use LR to predict labels of tr EE -> compute train score

Here NE denotes node embeddings and EE edge embeddings. Also, the library does most of this for you. You only need to follow the steps in the previous answer: 1) compute the random edge sample, 2) initialize the traintest_split object with those samples and set TG=G. Then 3) you can initialize the evaluator. 4) use the traintest_split.train_edges to train your method and obtain node embeddings and 5) then call evaluate_ne() with the embeddings you got :)

from evalne.

LPioL avatar LPioL commented on August 22, 2024

Thank you for the precision. Would you have a link to an article in which you based your approach ?

from evalne.

LPioL avatar LPioL commented on August 22, 2024

Hello,
How to define the precision@K ?

Best.

from evalne.

LPioL avatar LPioL commented on August 22, 2024

from evalne.

Dru-Mara avatar Dru-Mara commented on August 22, 2024

Sorry, I misunderstood the question,

When using EvalNE as a command-line tool you can set in the conf file the value of k in the variable PRECATK_VALS as e.g.:

  • PRECATK_VALS = 1 2 10 100 200 500 800 1000 10000 100000

When using EvalNE as an API you can set the k values for which to compute the precision while creating a ScoreSheet object, e.g.:

  • scoresheet = Scoresheet(tr_te='train', precatk_vals=[1,2,10,100,1000])

If you are not using the ScoreSheet, the Results objects the library returns after evaluating each method also have a precicionat_k(k=100) function which takes k as a parameter.

If you are evaluating NR and want to compute the approximated precision@k for a random sample of all node pairs in the graphs, you can set that proportion in the variable NR_EDGE_SAMP_FRAC, e.g.

  • NR_EDGE_SAMP_FRAC = 0.1 (this corresponds to 10% of all edges in the graph)

Regards,
Alex

from evalne.

LPioL avatar LPioL commented on August 22, 2024

from evalne.

Dru-Mara avatar Dru-Mara commented on August 22, 2024

Hi,

I just realized that the documentation of the Results object is not complete, I'm sorry about that. I'll try to improve it as soon as I have some time. In the meantime, here is an answer to your question:

The nee.evaluate_ne() function returns a Results object. This object contains two attributes, the train and the test Scores for the methods you have evaluated so you can call:
train_scores = tmp_result_node2vec.train_scores
test_scores = tmp_result_node2vec.test_scores

Since you are evaluating NR, you are only interested in the train scores. The Scores class offers a set of methods described here. Amongst these methods we have auroc(), accuracy(), precision() and precisionatk(k=100). You are interested in precision at K so you can call:

precAt1= tmp_result_node2vec.train_scores.precisionatk(k=1)
precAt100= tmp_result_node2vec.train_scores.precisionatk(k=100)
precAt10000= tmp_result_node2vec.train_scores.precisionatk(k=10000)

If you want to show NR performance in terms of auroc or f score you can call:
auroc= tmp_result_node2vec.train_scores.auroc()
f_score= tmp_result_node2vec.train_scores.f_score(beta=1)

I hope this clarifies things.
Alex

from evalne.

LPioL avatar LPioL commented on August 22, 2024

from evalne.

LPioL avatar LPioL commented on August 22, 2024

from evalne.

LPioL avatar LPioL commented on August 22, 2024

Hi,

Just another question. I realized that when samp_frac = 100, the number of positive edges pos_e is inferior to the real number of true edges in the graph.
I don't understand why. If you can explain, it will be great!

I used the code you gave me:
pos_e, neg_e = stt.random_edge_sample(nx.adj_matrix(G), samp_frac = 100,
nx.is_directed(G))
traintest_split.set_splits(train_E=pos_e, train_E_false=neg_e, test_E=None,
test_E_false=None, directed=nx.is_directed(G), nw_name='test_network', TG=G)
nee = NREvaluator(traintest_split, embed_dim, lp_model)

Thanks!

from evalne.

Dru-Mara avatar Dru-Mara commented on August 22, 2024

Hi,

Thanks for pointing that out! The behaviour of the random_edge_sample function is indeed incorrect for large values of the samp_frac. This function was designed to be as efficient as possible in generating random edge pairs which meant allowing for repetitions in the generation process. Thus as the samp_frac approaches, 1 many candidate pairs will be generated several times meaning that the final number of distinct node pairs will be lower than expected.

For sampling proportions close to the total number of possible edge pairs (or all possible pairs). You should use the random_edge_sample_other() function. It takes the same arguments (with samp_frac in (0,1] ) and should return the expected number of candidates. Please mind that in this function you will need to compute the transpose of the negative edges. So your example would look like:

pos_e, neg_e = stt.random_edge_sample_other(nx.adj_matrix(G), samp_frac = 1, nx.is_directed(G))
neg_e = neg_e.T
traintest_split.set_splits(train_E=pos_e, train_E_false=neg_e, test_E=None,
test_E_false=None, directed=nx.is_directed(G), nw_name='test_network', TG=G)
nee = NREvaluator(traintest_split, embed_dim, lp_model)

We will patch this in an upcoming version of the library, in the meanwhile, I hope this helps.
Alex

from evalne.

LPioL avatar LPioL commented on August 22, 2024

from evalne.

Dru-Mara avatar Dru-Mara commented on August 22, 2024

Hi Léo,

You can use the random_edge_sample_other() function always by default. You should have no issues unless you try to sample 100% of node pairs from an adjacency matrix that is, let's say, 10.000 by 10.000. In that case, you might run out of memory with either functions.

The random_edge_sample() function is just more efficient and works especially well in those cases where you want to sample less than 50% of the edges. (in the literature most authors sample between 1% and 10% of node pairs for network reconstruction, depending on the network size)

Regards,
Alex

from evalne.

LPioL avatar LPioL commented on August 22, 2024

from evalne.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.