Code Monkey home page Code Monkey logo

musae's Introduction

MUSAE

Arxiv codebeat badge repo size benedekrozemberczki

The reference implementation of Multi-Scale Attributed Node Embedding. (Journal of Complex Networks 2021)

Abstract

We present network embedding algorithms that capture information about a node from the local distribution over node attributes around it, as observed over random walks following an approach similar to Skip-gram. Observations from neighborhoods of different sizes are either pooled (AE) or encoded distinctly in a multi-scale approach (MUSAE). Capturing attribute-neighborhood relationships over multiple scales is useful for a diverse range of applications, including latent feature identification across disconnected networks with similar attributes. We prove theoretically that matrices of node-feature pointwise mutual information are implicitly factorized by the embeddings. Experiments show that our algorithms are robust, computationally efficient and outperform comparable models on social, web and citation network datasets.

The second-order random walks sampling methods were taken from the reference implementation of Node2Vec.

The datasets are also available on SNAP.

The model is now also available in the package Karate Club.

This repository provides the reference implementations for MUSAE and AE as described in the paper:

Multi-Scale Attributed Node Embedding. Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Journal of Complex Networks 2021

Table of Contents

  1. Citing
  2. Requirements
  3. Datasets
  4. Logging
  5. Options
  6. Examples

Citing

If you find MUSAE useful in your research, please consider citing the following paper:

>@article{musae,
          author = {Rozemberczki, Benedek and Allen, Carl and Sarkar, Rik},
          title = {{Multi-Scale Attributed Node Embedding}},
          journal = {Journal of Complex Networks},
          volume = {9},
          number = {2},
          year = {2021},
}

Requirements

The codebase is implemented in Python 3.5.2. package versions used for development are just below.

networkx          2.4
tqdm              4.28.1
numpy             1.15.4
pandas            0.23.4
texttable         1.5.0
scipy             1.1.0
argparse          1.1.0
gensim            3.6.0

Datasets

Logging

The models are defined in a way that parameter settings and runtimes are logged. Specifically we log the followings:

1. Hyperparameter settings.     We save each hyperparameter used in the experiment.
2. Optimization runtime.        We measure the time needed for optimization - measured by seconds.
3. Sampling runtime.            We measure the time needed for sampling - measured by seconds.

Options

Learning the embedding is handled by the src/main.py script which provides the following command line arguments.

Input and output options

  --graph-input      STR   Input edge list csv.     Default is `input/edges/chameleon_edges.csv`.
  --features-input   STR   Input features json.     Default is `input/features/chameleon_features.json`.
  --output           STR   Embedding output path.   Default is `output/chameleon_embedding.csv`.
  --log              STR   Log output path.         Default is `logs/chameleon.json`.

Random walk options

  --sampling      STR       Random walker order (first/second).              Default is `first`.
  --P             FLOAT     Return hyperparameter for second-order walk.     Default is 1.0
  --Q             FLOAT     In-out hyperparameter for second-order walk.     Default is 1.0.
  --walk-number   INT       Walks per source node.                           Default is 5.
  --walk-length   INT       Truncated random walk length.                    Default is 80.

Model options

  --model                 STR        Pooled or multi-scale model (AE/MUSAE).      Default is `musae`.
  --base-model            STR        Use of Doc2Vec base model.                   Default is `null`.
  --approximation-order   INT        Matrix powers approximated.                  Default is 3.
  --dimensions            INT        Number of dimensions.                        Default is 32.
  --down-sampling         FLOAT      Length of random walk per source.            Default is 0.001.
  --exponent              FLOAT      Downsampling exponent of frequency.          Default is 0.75.
  --alpha                 FLOAT      Initial learning rate.                       Default is 0.05.
  --min-alpha             FLOAT      Final learning rate.                         Default is 0.025.
  --min-count             INT        Minimal occurence of features.               Default is 1.
  --negative-samples      INT        Number of negative samples per node.         Default is 5.
  --workers               INT        Number of cores used for optimization.       Default is 4.
  --epochs                INT        Gradient descent epochs.                     Default is 5.

Examples

Training a MUSAE model for a 10 epochs.

$ python src/main.py --epochs 10

Changing the dimension size.

$ python src/main.py --dimensions 32

License

musae's People

Contributors

benedekrozemberczki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

musae's Issues

What's the meaning of features?

I download the datasets (github) from SNAP, but I'm now confused about the features in .json format.
Have they been preprocessed already so that they can be put into use without further processing?
Or do I need to understand what each dimension in the features mean?

Node features for Facebook graph

Hi, thanks for your contributions!

About the Facebook dataset - what do the node features represent, and how were they generated? The paper mentions that the features are extracted from site descriptions. Does this mean they're text features, and if so which text representation or embedding did you use?

A question on meaning of the node feature.

Thank you for your excellent work! And I would be very grateful if you could answer my question. That is, what's the meaning of the numbers in the node feature json file. For example, in the MUSAE/input/features/git.json. I guess that one vector in the json corresponds to a node, and you mentioned in the manuscript that ` Node features are location, starred repositories, employer and e-mail address'. How can I turn these infomation into the numbers in the json file?

Thank you!

A question about node labels

Hi Benedek,

I have one question about the file "DE_target.csv". There are several files like this one in the repository.

There are several columns in this file, including "id", "days", "mature", "view", "partner", and "new_id". I am curious about which column indicates the label of a node, that is, whether a streamer uses explicit language.

Could you give me a hint about this? Many thanks!

Best regards,
Simon

reduce node feature dimension

Hi, thank you for your great work!
I have further questions.

  1. FacebookPagePage dataset

    • I want to reduce its dimension from 128 to 64.
    • So can I get the raw text which you used?
    • I saw your recommendation on issue3. Can I do dimensionality reduction on this dataset, too?
  2. Twitch datasets

    • I want to reduce these too.
    • The paper mentions, "Node features are games liked, location and streaming habits."
    • So I think simple dimensionality reduction on this dataset might be harmful.
    • How can I handle these?

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.