Hi, I'm trying to use sparse distance matrix for agglomerative hiera

ok, so I created a <a href="https://github.com/bobermayer/pynndescent/tree/save_unused

access distances in heap about pynndescent HOT 3 OPEN

bobermayer commented on May 26, 2024

access distances in heap

from pynndescent.

Comments (3)

lmcinnes commented on May 26, 2024

So the bad news is that distances calculated during the search are thrown away to save memory, so they just aren't accessible as you would want.

The possibly good news is that there are some hidden features that may do what you want. There is a module graph_utils that you can import as pynndescent .graph_utils that has added functionality to "complete" a knn graph to have a single connected component (specifically pynndescent.graph_utils.connect_graph). If that isn't quite what you need (for your clustering purposes) then the function find_component_connection_edge, which finds the shortest edge between two components in the graph (well, approximates it with an approximate search) can be modified to return a set of edges, and that might fit your needs. Are any of these useful, or do you need a more general random sampling of distances?

from pynndescent.

bobermayer commented on May 26, 2024

thanks for the reply. looking a bit more into the code, I suspected as much. I have to check whether my graphs are fully connected to see if the functionality you mention would help.
other than that, I was wondering if I could maybe increase the heap size, but I guess it's always assumed to be exactly n_neighbors and would anyway only save the smallest distances?
carrying through another bigger (unsorted?) heap could be a less efficient way of saving all those distances, not sure the overhead is worth it, because calculating them is not actually that expensive, so maybe that's what I'll try first.

from pynndescent.

bobermayer commented on May 26, 2024

ok, so I created a fork where the distances discarded in the heap_push functions are saved instead to backup arrays. this seems to work but I haven't done any detailed testing yet. I could try to implement this in a more elegant way in case you're interested.

however, I'm getting errors with numba when installing the package locally (using python setup.py install): I can run it once, but I'm getting seg faults afterwards (more specifically, I can run it multiple times in the same python session, but not in successive independent python sessions). this also happens with your master branch though.
stack traces suggest a problem in rp_trees.make_dense_tree (but unrelated to #209 as it happens with euclidean and correlation distance metrics) ~~, and it might be caused by indices = np.arange(data.shape[0]).astype(np.int32) in line 830~~

No implementation of function Function(<built-in function arange>) found for signature:
  >>> arange(int64)

thanks for any input. I have python 3.8.0, numpy 1.23.5, numba 0.53.1 (or 0.56.1 on another system, similar problem)

update: turning off the numba caching for the make_tree functions in rp_trees.py appears to resolve this issue.

another update: maybe related to caching problems for recursive functions?

from pynndescent.

access distances in heap about pynndescent HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent