falconn-lib / falconn Goto Github PK
View Code? Open in Web Editor NEWFAst Lookups of Cosine and Other Nearest Neighbors (based on fast locality-sensitive hashing)
Home Page: http://falconn-lib.org/
License: MIT License
FAst Lookups of Cosine and Other Nearest Neighbors (based on fast locality-sensitive hashing)
Home Page: http://falconn-lib.org/
License: MIT License
The readme states: "Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search."
So how would one go about using the maximum inner product search?
Now linker fails because of not inlined functon in src/test/test_utils.h.
The fix is trivial:
--- a/src/test/test_utils.h
+++ b/src/test/test_utils.h
@@ -11,7 +11,7 @@
namespace falconn {
namespace test {
-void compare_vectors(const std::vector<float>& expected,
+inline void compare_vectors(const std::vector<float>& expected,
const std::vector<float>& result, float eps) {
ASSERT_EQ(expected.size(), result.size());
for (size_t ii = 0; ii < expected.size(); ++ii) {
Make sure our code mostly follows https://google.github.io/styleguide/cppguide.html and use their lint tool.
Ideally, we'd have a clear subset of the Google rules we adopt and a corresponding lint script (based on cpplint.py).
On my machine building the Python package consumes more than 2 Gb of RAM, I guess because of heavily relying on C++ templates.
If there something we can do about it?
It might make sense to wait with this until Cxx.jl (https://github.com/Keno/Cxx.jl) is supported by a default Julia installation (maybe in 0.5?). Otherwise, we might need to write a C wrapper around FALCONN (which might actually be useful in other cases, too).
We should keep track of people who report bugs, help with testing, etc.!
So far we have:
The build targets for all the tests should be generated fully automatically by a script. The list of headers can also be generated by a script.
Seeing as you're planning a future Julia wrapper, I thought I'd ask for an R one too, so I can try FALCONN instead of RcppAnnoy.
The main point of this release is to upload a package to PyPi built with a corrected version of swig
(see #58 for the discussion). However, it would be good to address the following minor issues:
clang
and swig
, see #58)As far as I understand, the underlying algorithm doesn't require storing the dataset in memory, only the generated hash tables.
Given the favorable computational complexity, falconn is very interesting to use with large datasets, but then storing whole datasets in memory becomes either infeasible or very expensive.
All in all, I'm not sure where the requirement of storing the whole dataset comes from โ is it an inherent limtation of the method or was it just not implemented yet (and if so, are there plans to?)
I tried to use find_k_nearest_neighbors function in multi-thread, but got the "double free or corruption" error.
So do we have the multi-thread safe version of LSHTable?
LSH computations can easily be spread across multiple cores.
Currently, it is done by hand via clang-format --style=Google -i <file-name>
for C++ and yapf -i <file-name>
for Python.
pybind11 is a fairly new project for accessing C++ code from Python:
https://github.com/pybind/pybind11
There are three advantages for us:
pybind11 is planning to release v2 in early January. v2 brings many improvements for the Eigen wrapper, so it's probably good to start with v2 on our side.
Maybe also rename to set_default_parameters .
When I try to run the GloVe example on my Mac, I get the following error: http://pastebin.com/heW2aF5P
The unzip version is http://pastebin.com/W7ykXz8G .
The SHA1 seems to match what towhee gets (where the file is correctly unzipped):
$ shasum glove.840B.300d.zip
8084fbacc2dee3b1fd1ca4cc534cbfff3519ed0d glove.840B.300d.zip
The GloVe file doesn't belong to us. Can we do something about this?
Remove the following warnings. From a quick search it looked like clang might be too cautious here.
In file included from falconn/swig/falconn_wrap.cc:3591:
falconn/src/include/falconn/lsh_nn_table.h:179:5: warning: suggest braces around initialization of subobject [-Wmissing-braces]
"unknown",
^~~~~~~~~~
falconn/src/include/falconn/lsh_nn_table.h:209:5: warning: suggest braces around initialization of subobject [-Wmissing-braces]
"unknown",
^~~~~~~~~~
2 warnings generated.
It shows:"C3861 '__builtin_prefetch': identifier not found FastLSH d:\setup\falconn-1.2.2\src\include\falconn\core\prefetchers.h 57"
Anyone can help?
Hi,
I know the definition of find_near_neighbors are as follows:
def find_near_neighbors(self, query, threshold)
Find all the points within threshold distance from query.
I would like to know how to calculate near neighbors within distance of threshold from a GIVEN candidates. For example, given smaller candidates is provided as a list of indexes.
Actually, I have a large dataset, I only want to calculate a document's nearest neighbors in a time window [-3 days, +3 days]. And I want to avoid to build a LSH index for each days' data.
Thanks very much.
We have the following methods for getting near(est) neighbors:
find_closest
find_k_nearest_neighbors
find_near_neighbors
I find this a little inconsistent. Should we rename find_closest
to find_nearest_neighbor
?
It would be nice to use Eigen's sparse vector class because it supports linear algebra operations without additional work on our side (as opposed to the std::vector<std::pair<int, float>> we are currently using). But we should first do a small benchmark to compare the two options.
Hello,
I'm experiencing the following error while trying to instal falconn with pip install.
Gist
My Setup: OS X 10.11, virtual environment with python 3.5 and numpy 1.11.1 . Shouldn't make a difference, but I have Eigen installed with homebrew.
Currently, the Python wrapper supports only dense data (via Numpy arrays), would be nice to generalize it to handle sparse data as well.
At the same time, random_benchmark.py works well.
We should see if we can do something similar to annoy's mmap solution.
Basically a more memory-efficient version of FlatHashTable
.
Hi @ludwigschmidt,
I find the find_k_nearest_neighbors() function only returns the index of the search result, with no distance returned. Is it possible to return distance using find_k_nearest_neighbors() function?
Looking forward to your reply.
Thanks.
In particular, try radix sort instead of std::sort
.
Currently, compiling anything that includes FALCONN takes quite a bit of time and RAM (on towhee running make python_package_install
eats more than 2.4 Gb of RAM).
It's a bit painful: for instance, one can't run make python_package_install
on the machine, where the website is hosted, which is unfortunate, since it'd be way more convenient to generate docs on it, instead of pulling them from somewhere.
In some regimes (e.g., the "standard" random point set), the CP probing code is currently the bottleneck (not the candidate distance computation!).
Points to cover:
clang-format --style=Google -i <file-name>
Need to use the code like below instead of "exponential" case analysis in the wrapper.
#include <iostream>
#include <memory>
class Base {
public:
virtual void impl() = 0;
virtual ~Base() {}
};
template<bool u, bool v> class Derived : public Base {
public:
void impl() {
std::cout << u << " " << v << std::endl;
}
};
class Factory {
public:
static std::shared_ptr<Base> construct(bool u, bool v) {
if (u) {
return construct_1<true>(v);
}
else {
return construct_1<false>(v);
}
}
private:
template<bool u> static std::shared_ptr<Base> construct_1(bool v) {
if (v) {
return construct_2<u, true>();
}
else {
return construct_2<u, false>();
}
}
template<bool u, bool v> static std::shared_ptr<Base> construct_2() {
return std::shared_ptr<Base>(new Derived<u, v>);
}
};
int main() {
std::shared_ptr<Base> x = Factory::construct(false, true);
x->impl();
return 0;
}
compute_number_of_hash_functions
, etc.)Is there a way to get only one result when using the find_near_neighbors? I am interesting in solving the decision problem; is there a point in the pointset thats lies within an r radius with the query?
By using the find_near_neighbors it tries to find all the points that satisfy this condition.
Postponed to v1.3:
It is not common practice, and for a good reason.
With that code in place, it's impossible to do a one-pass pip install of a requirements.txt file including both numpy and falconn, because the setup.py of falconn would be run before numpy is installed (and fail!).
They are not really needed in these interface functions.
We might want a field in LSHConstructionParameters instead.
Probably good to figure #5 out first.
I use Anaconda2 64 bit on my Windows 7 system. After I typed "pip install FALCONN", it downloads the FALCONN package and attempt to install but I got the following errors:
KeyType = int; PointSet = falconn::PlainArrayPointSet]'
falconn/swig/python_wrapper.h:337:31: required from here
falconn/src/include/falconn/wrapper/../core/polytope_hash.h:90:9: error: 'posix_memalign' was not declared in this scope
In file included from falconn/external/eigen/Eigen/Core:344:0,
from falconn/external/eigen/Eigen/Dense:1,
from falconn/src/include/falconn/eigen_wrapper.h:11,
from falconn/src/include/falconn/falconn_global.h:8,
from falconn/swig/falconn_wrap.cc:3689:
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h: In function 'Packet Eigen::internal::pmadd(const Packet&, const Packet&, const Packet
&) [with Packet = __vector(2) double]':
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h:250:134: warning: control reaches end of non-void function [-Wreturn-type]
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h: In function 'Packet Eigen::internal::pmadd(const Packet&, const Packet&, const Packet
&) [with Packet = __vector(4) float]':
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h:249:134: warning: control reaches end of non-void function [-Wreturn-type]
error: command 'C:\Anaconda2\MinGW\bin\gcc.exe' failed with exit status 1
----------------------------------------
Command "C:\Anaconda2\python.exe -u -c "import setuptools, tokenize;file='c:\users\admini1\appdata\local\temp\pip-build-xmpbnu\FALCONN\se1\ap
tup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record c:\users\admini
pdata\local\temp\pip-yj92v2-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\admini~1\app
data\local\temp\pip-build-xmpbnu\FALCONN\
The library of FALCONN only depends on Eigen, and the Eigen has Android version. Is't possible to use it on Android platform?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.