bears-r-us / arkouda-njit Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 12.0 68.29 MB

Home of Arachne and other Arkouda functionality provided by researchers at NJIT

License: MIT License

Python 32.02% Chapel 58.42% Shell 0.19% Jupyter Notebook 9.37%

arkouda-njit's People

Contributors

Stargazers

Watchers

Forkers

stress-tess kaydoh ak47mrj compiling-is-winning jtpatchett alvaradoo rp98njit runeblaze jakobtroidl mdindoost palina-pauliuchenka garrettgr

arkouda-njit's Issues

Implement building method from edges using Arkouda GroupBy functionality

After meeting with Mike, an Arkouda method-based graph construction could be simpler and more effective. Will attach a sample script later of methods we would want to use.

Indexing into returned arrays from graph methods only works with internal vertex values

Say we have a graph, G with vertex names {2,6,7} the internal values would be represented with {0,1,2}. When we run BFS on G the returned depth array, D, is based off values {0,1,2}. How can we let the user index into D based off the original values instead of the internal values?

Enhance Graph Algorithm Implementation by Providing A Priori Probability of Graphs (Labels and Rels)

In order to facilitate the implementation of various algorithms and foster research in graph theory, it is imperative to have access to a priori probabilities of Nodes' labels and edges' rels of graphs. This addition would significantly enhance the efficiency and effectiveness of graph-related endeavors, enabling users to make informed decisions and conduct thorough analyses.

Mismatched argument in graph_bfs

Calling graph_bfs raises a NameError:

NameError: name 'RCMFlag' is not defined

Looking at the code it's on line 436 of graph.py where the argument passed in is called rcm_flag but the line needs to be updated to use that name instead of "RCMFlag": RCMFlag in the args dictionary.

Add support for Categorical in Property Graphs

ar.PropGraph() behaves as a wrapper to dataframes where new arrays are explicitly stored in the back-end for columnar data. We need support for when a column is an Arkouda Categorical.

Create Arachne documentation

Self-explanatory. We want documentation that looks like this: https://bears-r-us.github.io/arkouda/ or like this: https://networkx.org/documentation/stable/reference/index.html.

Even when specifying the name of source and destination columns, error is thrown expecting "src" or "dst"

When running load_edge_attributes as below:

prop_graph.load_edge_attributes(test_edge_df, source_column="src1", destination_column="dst1", relationship_columns=["data5", "data1"])

The following chain of errors is thrown

KeyError: "Invalid column name 'src'."
KeyError: 'duplicated attribute (column) name in relationships'

This has to do with add_edge_relationships expecting columns named "src" and "dst".

Feature Request: Add Subgraph Node ID Mapping in subgraph_isomorphism Output

Hi!

I'm currently using the subgraph_isomorphism function to find isomorphic subgraphs in a host graph. However, the current isos array returned only includes the vertex IDs from the host graph without providing a mapping to the IDs of the subgraph. This makes it challenging to reconstruct the subgraph using the host graph IDs.

Here is the code illustrating the current behavior:

G = ar.PropGraph()
src = [1, 2, 3, 2]
dst = [2, 3, 4, 7]
G.add_edges_from(ak.array(src), ak.array(dst))

subgraph = ar.PropGraph()
src = [10, 20, 30, 20]
dst = [20, 30, 40, 70]
subgraph.add_edges_from(ak.array(src), ak.array(dst))

isos = ar.subgraph_isomorphism(G, subgraph)
print(isos)

The output I get is:

[1 2 3 4 7]

With help from Oliver, I created a function that maps the subgraph IDs to the host graph IDs based on the isos output, allowing me to recreate the subgraph. Here's a concise explanation of the provided code:

The function first checks if the length of the isos array is a multiple of the number of subgraph nodes. It then iterates through each isomorphic subgraph found, creating a mapping from the subgraph node IDs to the corresponding host graph node IDs. Finally, it reconstructs the subgraph edges using the host graph node IDs.

Here's the complete code:

import arkouda as ak
import arachne as ar

ak.connect()

# Define the host graph
G = ar.PropGraph()
src_host = [1, 2, 3, 2]
dst_host = [2, 3, 4, 7]
G.add_edges_from(ak.array(src_host), ak.array(dst_host))

# Define the subgraph
subgraph = ar.PropGraph()
src_sub = [10, 20, 30, 20]
dst_sub = [20, 30, 40, 70]
subgraph.add_edges_from(ak.array(src_sub), ak.array(dst_sub))

# Find isomorphic subgraphs
isos = ar.subgraph_isomorphism(G, subgraph)
print(f"Isomorphisms found: {isos}")
isos_ndarray = isos.to_ndarray()  # Convert pdarray to ndarray

# Check if the length of isomorphisms is a multiple of the number of subgraph nodes
if len(isos) % len(subgraph) != 0:
    raise ValueError("The length of isomorphisms is not a multiple of the number of subgraph nodes.")

subgraph_nodes=[]
host_graph_nodes=[]
node_mapping={}

number_isos_found= len(isos)/len(subgraph)
for i in range(0,int(number_isos_found)):
    # Create a mapping from subgraph nodes to host graph nodes
    subgraph_nodes = sorted(list(set(src_sub + dst_sub)))
    host_graph_nodes = isos_ndarray[i*len(subgraph_nodes):i*len(subgraph_nodes) + len(subgraph_nodes)]
    node_mapping = dict(zip(subgraph_nodes, host_graph_nodes))
    print("Mapping:", node_mapping)
    # Recreate the subgraph with host graph node IDs
    mapped_src_sub = [node_mapping[node] for node in src_sub]
    mapped_dst_sub = [node_mapping[node] for node in dst_sub]

    # Output the mapped subgraph edges
    print("Mapped subgraph edges (using host graph node IDs):")
    for s, d in zip(mapped_src_sub, mapped_dst_sub):
        print(f"{s} -> {d}")

The output I get is:

Isomorphisms found: [1 2 3 4 7]
Mapping: {10: 1, 20: 2, 30: 3, 40: 4, 70: 7}
Mapped subgraph edges (using host graph node IDs):
1 -> 2
2 -> 3
3 -> 4
2 -> 7

I hope this helps to implement a mapping feature directly into the library!

Incorporate Local and Global Clustering Coefficients for Enhanced Graph Analysis

Request:
To elevate the depth and accuracy of graph analysis, I propose the inclusion of both local and global clustering coefficients within the framework. These coefficients play pivotal roles in characterizing the structural properties of graphs.

Port arkouda-contrib/akgraph graph generators to arachne

There are some graph generators in arkouda-contrib/akgraph that can be ported into Arachne. They are expected to live inside of arachne/client in a file called generators.py or similar.

Most of the return statements for the generator functions are of the format:

return standardize_edges(U,V)

This can be directly mapped to Arachne functionality by doing:

graph = ar.Graph() # or DiGraph() or PropGraph()
graph.add_edges_from(U,V)

Users should be able to dictate if they want to build a Graph(), DiGraph(), or PropGraph(). No need to worry about remove multiple edges or self-loops as the add_edges_from() method handles removing multiple edges, and self-loops are allowed in the base data structure.

001-correctness-check-bfs

Description

Add correctness checks for BFS

Steps Involved

Arachne build breaks on Chapel 1.33 with `CHPL_COMM` set to `none`

Using the following commands to set the Chapel environment on version 1.33:

source $CHPL_HOME/util/setchplenv.bash
export CHPL_RE2=bundled
export CHPL_LLVM=bundled
export CHPL_GMP=bundled
export CHPL_COMM=none

We get the following error when building Arachne:

/scratch/users/oaa9/arkouda-njit/arachne/server/Utils.chpl:17: In function 'fastLocalSubdomain':
/scratch/users/oaa9/arkouda-njit/arachne/server/Utils.chpl:19: error: unresolved call 'unmanaged domain(1,int(64),one).locDoms[int(64)]'
$CHPL_HOME/modules/dists/BlockDist.chpl:546: note: this candidate did not match: BlockDom.locDoms
/scratch/users/oaa9/arkouda-njit/arachne/server/Utils.chpl:19: note: because call includes 1 argument
$CHPL_HOME/modules/dists/BlockDist.chpl:546: note: but function can only accept 0 arguments
/scratch/users/oaa9/arkouda-njit/arachne/server/Utils.chpl:19: note: other candidates are:
$CHPL_HOME/modules/dists/SparseBlockDist.chpl:89: note:   SparseBlockDom.locDoms
  /scratch/users/oaa9/arkouda-njit/arachne/server/Utils.chpl:65: called as fastLocalSubdomain(blockArray: [domain(1,int(64),one)] int(64)) from function 'generateRanges'
  /scratch/users/oaa9/arkouda-njit/arachne/server/BuildGraphMsg.chpl:98: called as generateRanges(graph: shared SegGraph, key: string, key2insert: string, array: [domain(1,int(64),one)] int(64))
note: generic instantiations are underlined in the above callstack
make: *** [Makefile:359: arkouda_server] Error 1
make: Leaving directory '/scratch/users/oaa9/arkouda'

This is caused due to BlockDom.locDoms in Chapel expecting 1 argument when CHPL_COMM is set and 0 arguments when CHPL_COMM is none.

Add support for uint64 in methods for add_edges_from()

uint64 dtype does not seem to be supported. I have cast those data types to int64. This noticeably showed up on the src and dst columns for the graphs.

002-code-optimization

Create density checkers for undirected and directed graphs

https://www.baeldung.com/cs/graph-density

Integrate Sorting Functionality for Degree Ascending and Descending Order

To streamline graph analysis workflows and facilitate easier examination of node connectivity, I suggest incorporating sorting functionality for degrees in both ascending and descending order.

Running list of Arachne errors that kill the Arkouda server

Currently, errors in Arachne such as using uint64 instead of int64 during file ready are crashing the server. These errors should be handled properly so that the Arkouda server doesn't need to be restarted every time an error is encountered. Most of this will be handled by using Chapel throws and catches. (https://chapel-lang.org/docs/language/spec/error-handling.html). The reading method is just an example, all functions will need to be checked .

Combined edge and node attribute insertion

Currently, we require edge attributes and node attributes to be loaded separately from different dataframes via the load_edge_attributes and load_node_attributes functions in Arachne. This issue proposes that we allow node attribute columns to be specified for vertices during edge insertion where the data specified for that vertex in that edge can be stored as a node attribute.

This will improve performance since currently a user may have to do a large merge-join on an edge attribute dataframe to get all of the data for each vertex.

Example of desired functionality from Tom:

import arkouda as ak
import arachne as akg
from glob import glob
import pandas as pd
import numpy as np
import socket
import timeit
import os

ak.connect("xxxx")

rawfilelist = ["file1","file2","file3"]
rawfilelist = rawfilelist[0:500]

columns = ["src_ip","src_port","dst_ip","dst_port","protocol"]

rawdata = ak.readmethod(rawfilelist,datasets = columns) # substitute with method to read the appropriate file types
raw_df = ak.DataFrame(rawdata)

raw_df.columns

["src_ip",
 "src_port",
 "dst_ip",
 "dst_port",
 "protocol"]

#
# TEMP FIX FOR PROPERTY GRAPH AS CODE LOOKS for "src" and "dst"
#
filtered_df = raw_df
filtered_df["src"] = filtered["src_ip"]
filtered_df["dst"] = filtered["dst_ip"]

prop_graph = akg.PropGraph()

#
# Add new collection to indicate properties to gather from src/dst nodes when they are created.
# In my case I calculated those values to use in the load_node_attributes method like the following:
#
# prop_graph.load_node_attributes(node_df,node_column="nodes",label_columns=["ip","port","protocol"]
#
# BELOW IS THE DESIRED CODE WHICH AVOIDS HAVING TO PERFORM MERGE-JOINS.
#
# NOTE: Some items left to decide.  Does the load_edge_attributes calculate the node_columns or are they
#       provided in the filtered_df.  If there are multiple node_column values for each vertice, are they
#       stored as a collection.  For instance, src has two different protocols (two edges) or two ports, etc.
#
prop_graph.load_edge_attributes(filtered_df,
                                source_column="src",
                                destination_column="dst",
                                relationship_columns=["protocol","src_port","dst_port"],
                                node_columns=["ip","port","protocol"])

Triangle counting returns node ID rather than triangle count

For some reason, the triangle counting function in Arachne returns the node input ID rather than the found triangles. Here's the head .csv dataset I used to build the graph.

**HEADER**
int64,int64,float64
*/HEADER/*
bodyId_pre,bodyId_post,weight
294437328,295470623,1
294437328,295133902,1
294437328,448260940,1
294437328,294783423,1
294437328,5812979995,1
294437328,295474441,2
294437328,265120223,1
294437328,296139882,1

Here's the code snippet and output of my triangle function call.

tri_nodes = ak.array([-1])
count = ar.triangles(graph, vertexArray=tri_nodes)
print(count)
# [-1]

@alvaradoo thinks this might be a bug. Any idea how this could be fixed?