Code Monkey home page Code Monkey logo

cggc's Introduction

Randomized Greedy (RG) modularity clustering algorithm with Core Groups Graph ensemble Clustering Iterative (CGGC/CGGCi) scheme

The Paper: An Ensemble Learning Strategy for Graph Clustering by Michael Ovelgönne and Andreas Geyer-Schulz at Graph Partitioning and Graph Clustering, Contemporary Mathematics, American Mathematical Society, 2013
Winner of modularity maximization and modularity pareto (time + quality) challenges, 10th DIMACS Implementation Challenge
Original sources: http://www.umiacs.umd.edu/~mov/

Copyright 2009-2012 Michael Ovelgönne <mov -a-t- umiacs.umd.edu>, Karlsruhe Institute of Technology.
Modified by Artem Lutov [email protected], University of Fribourg.

Licensed under the LGPL v2.1.

This is a modified version of the original rgmc (RG, CGGC_RG, CGGCi_RG clustering algorithms) extended with additional I/O formats and some minor fixes (mainly in the I/O): full support of the official Metis format of the input graph, refined parsing of the Pajek, added NSL input format and CNL output format to be applicable in the PyCABeM clustering benchmark. The core algorithm itself is untouched (though, the implementation is not optimal).

Content

Deployment

Dependencies

Boost program options library (v. >= 1.42) "sudo apt-get install libboost-program-options-dev"

Compilation

$ make [release|debug]
Compiler/Linker errors are most likely due to incorrect settings of include and lib paths.

Usage

Run rgmc with the following parameters:

$ ./rgmc -h
Performs clustering of the unweighed undirected network (graph) using RG, CGGC_RG or CGGCi_RG algorithms.

Supported Arguments:
  -h [ --help ]                   Display this message
  -i [ --inpfmt ] arg (=)        input network format (inferred from the file
                                  extension if not specified explicitly):
                                  e - nse format: header in comments + each
                                  line specifies a single edge: <sid> <did>,
                                  a - nse format: header in comments + each
                                  line specifies a single arc: <sid> <did>,
                                  m - Metis format of the unweighted network,
                                  p - Pajek format of the undirected unweighted
                                  network
  -s [ --startk ] arg (=2)        sample size of RG
  -f [ --finalk ] arg (=2000)     sample size for final RG step
  -r [ --runs ] arg (=1)          number of runs from which to pick the best
                                  result
  -e [ --ensemblesize ] arg (=-1) size of ensemble for ensemble algorithms (-1
                                  = ln(#vertices))
  -a [ --algorithm ] arg (=3)     algorithm: 1: RG, 2: CGGC_RG, 3: CGGCi_RG
  -o [ --outfmt ] arg (=l)        output clusters format:
                                  l - cnl format: header in comments + each
                                  line corresponds to the cluster and contains
                                  ids of the member vertices,
                                  c - each line corresponds to the cluster and
                                  contains ids of the member vertices,
                                  v - each line corresponds to the vertex and
                                  contains id of the owner cluster
  -f [ --outfile ] arg            file to store the detected communities
  -d [ --seed ] arg               seed value to initialize random number
                                  generator

For example:

$ rgmc --algorithm=2 --outfile=test.out test.graph
runs the CGGCi_RG algorithm on the graph test.graph and writes the results to
test.out

or just $ rgmc email.graph.

Input

The undirected unweighted input network to be clustered can be specified in the NSL (nsa/nse), Metis graph or Pajek network (using only *Edges) formats.

NSL format (nsa - arcs, directed network; nse - edges, undirected network) specifies network links in each line of the file as node ids separated by the space delimiter with optional # line comments and an optional header:

# Example Network .nse (edges - undirected)
# Nodes: 3  Edges: 3   Weighted: 0
# Note that the number of links corresponds to the number of payload lines in the file
0 1
# Empty lines and comments are allowed
0 2
2 1

Output

The CNL (clusters nodes list) output is the default and standard output format, generalization of the Stanford SNAP ground-truth communities. It is an input format for various NMI-evaluation algorithms. Each line of the file corresponds to the single resulting cluster, where member nodes are specified separated by the space/tab with optional share. For example:

# Clusters: 2, Nodes: 5, Fuzzy: 0
0
1 3 2 4

The clusters output for each vertex is also supported.

Related Projects

  • GenConvNMI - Overlapping NMI evaluation that is compatible with the original NMI (unlike the onmi).
  • OvpNMI - Another method of the NMI evaluation for the overlapping clusters (communities) that is not compatible with the standard NMI value unlike GenConvNMI, but it is much faster and yields exact results unlike probabilistic results with some variance in GenConvNMI.
  • ExecTime - A lightweight resource consumption (RSS RAM, CPU, etc.) profiler.
  • PyCABeM - Python Benchmarking Framework for the Clustering Algorithms Evaluation. Uses extrinsic (NMIs) and intrinsic (Q) measures for the clusters quality evaluation considering overlaps (nodes membership by multiple clusters).

cggc's People

Contributors

luav avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.