Comments (3)
So the bad news is that distances calculated during the search are thrown away to save memory, so they just aren't accessible as you would want.
The possibly good news is that there are some hidden features that may do what you want. There is a module graph_utils
that you can import as pynndescent .graph_utils
that has added functionality to "complete" a knn graph to have a single connected component (specifically pynndescent.graph_utils.connect_graph
). If that isn't quite what you need (for your clustering purposes) then the function find_component_connection_edge
, which finds the shortest edge between two components in the graph (well, approximates it with an approximate search) can be modified to return a set of edges, and that might fit your needs. Are any of these useful, or do you need a more general random sampling of distances?
from pynndescent.
thanks for the reply. looking a bit more into the code, I suspected as much. I have to check whether my graphs are fully connected to see if the functionality you mention would help.
other than that, I was wondering if I could maybe increase the heap size, but I guess it's always assumed to be exactly n_neighbors
and would anyway only save the smallest distances?
carrying through another bigger (unsorted?) heap could be a less efficient way of saving all those distances, not sure the overhead is worth it, because calculating them is not actually that expensive, so maybe that's what I'll try first.
from pynndescent.
ok, so I created a fork where the distances discarded in the heap_push
functions are saved instead to backup arrays. this seems to work but I haven't done any detailed testing yet. I could try to implement this in a more elegant way in case you're interested.
however, I'm getting errors with numba when installing the package locally (using python setup.py install
): I can run it once, but I'm getting seg faults afterwards (more specifically, I can run it multiple times in the same python session, but not in successive independent python sessions). this also happens with your master
branch though.
stack traces suggest a problem in rp_trees.make_dense_tree
(but unrelated to #209 as it happens with euclidean and correlation distance metrics) , and it might be caused by indices = np.arange(data.shape[0]).astype(np.int32)
in line 830
No implementation of function Function(<built-in function arange>) found for signature:
>>> arange(int64)
thanks for any input. I have python 3.8.0, numpy 1.23.5, numba 0.53.1 (or 0.56.1 on another system, similar problem)
update: turning off the numba caching for the make_tree
functions in rp_trees.py
appears to resolve this issue.
another update: maybe related to caching problems for recursive functions?
from pynndescent.
Related Issues (20)
- nested loop sequential or parallelization runtime HOT 1
- Distributed PyNNDescent HOT 1
- Can't pass parameters to weighted_minkowski distance metric HOT 3
- When distance computation is expensive how to gradually build graph HOT 3
- Correlation function needs reconsideration HOT 6
- Question about the "epsilon" parameter in the query function. HOT 2
- Importing pickled index gives error - 'NNDescent' object has no attribute 'shape' HOT 1
- self implemented distance metric HOT 3
- Fatal Python error: Segmentation faultFatal Python error: Segmentation fault HOT 1
- How to have a payload associated with each vector?
- TypeError: np.matrix is not supported with sklearn 1.2.0 HOT 2
- `make_dense_tree()` with `angular=True` can segfault on poorly-behaved datasets HOT 1
- Sample identifiers for semantic search HOT 2
- uint8 as internal data HOT 1
- Cosine metric - error "Negative values in data passed to precomputed distance matrix" HOT 2
- Question about covariance matrix used when using Mahalanobis distance
- Tests fail: E SystemError: initialization of _internal failed without raising an exception
- Newest version breaks with UMAP HOT 3
- Slice error using mac M1-max ARM HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pynndescent.