Comments (14)
I did have 42 minute compile time on the lp_unexpanded
with sm_70
. (pre-revert). I would be surprised if it were architecture specific.
from raft.
Another idea for ivf-pq.
Atm, we have lut_dtype
and internal_distance_type
as orthogonal parameters, which gives 8 different variants (lut_dtype = {fp8_signed, fp8_unsigned, fp16, fp32 }
, internal_distance_type = {fp16, fp32}
). I suggested above to remove the variants where sizeof(lut_dtype) > sizeof(internal_distance_type)
, which is only one combination out of eight (fp32, fp16). What if, instead, we'd introduce just a few configurations which make most sense in terms of performance/recall compromise? i.e. remove these two search parameters in favor of something like
enum class compute_optimization {
/** Represent the internal scores and the lookup table as float32. */
LEVEL_0,
/** Represent the internal scores as float32 and the lookup table as float16. */
LEVEL_1,
/** Represent the internal scores as float16 and the lookup table as float8. */
LEVEL_2
}
This should give us another 60% reduction in number of instances.
from raft.
Definitely, it's a well-known problem of the ivfpq_compute_similarity_kernel
! At some point during integration, I downlifted the PqBits to runtime, but that made the compute scores part of the kernel much heavier on ALU and resulted in 10-20% downgrade in the search QPS. Hence we've decided to bring it back to the template parameters. It can take values from 4 to 8, resulting in 5x more template instantiations.
What we potentially can improve is specializations of this kernel (which are done via the ivfpq_compute_similarity helper struct). On the one hand, they speed up the compile times by allowing compiling different instances in parallel. On the other hand, the current implementation of this produces some instances that do not make sense. For example, one could argue that it does not make much sense to have LutT
(float8/float16/float32 possible) larger than OutT
(float16/float32 possible), hence we can avoid compiling (float32, float16) combination. EnableSMemLut
boolean in the current selection logic implies PrecompBaseDiff
, so we can disable one of the four combinations of the two booleans. With a bit of code tweaking, we can fix IdxT = uin32_t
when the fused sort is disabled (Capacity == 0).
from raft.
For example, one could argue that it does not make much sense to have
LutT
(float8/float16/float32 possible) larger thanOutT
(float16/float32 possible), hence we can avoid compiling (float32, float16) combination
This sounds like a good idea. Perhaps we can make it a run-time error? Something like:
if constexpr (sizeof(OutT) < sizeof(LutT) {
RAFT_EXPECTS(false, "Size of LutT may not be larger than size of OutT");
} else {
// launch kernel
}
from raft.
Sure, that's not a problem, we can restrict the instantiation in the kernel selection logic (OutT
is the ScoreT
there) and throw the error right there, or along with the other runtime checks. This will only reduce the number of instances by 1/6 though.
from raft.
For the pairwise-distance kernels, I think the following would help:
- Consider removing some specializations for data alignments that confer little performance benefit (e.g.:
VecLen==2
for floats) - Standardize on a single index type. We now have specialization for
int
anduint32_t
. That seems excessive.
For measurements, I would propose we also save the ninja.log
file as a downloadable build artifact (I hope this is possible with github actions). This way, we can run ninjatracing easily to identify the bottlenecks.
As a general coding practice, maybe the following could help:
- Shrink the "surface area" of the template specializations. Often we have multiple templates arguments, each of which is only used in a specific portion of the code. We often use the following patten:
template <typename T1, typename T2, typename T3>
struct foo {
T2 do1(T1 x) { /* .. */}
T3 do2(T2 x) { /* .. */}
T3 call(T1 x) {
auto y = do1(x);
auto z = do2(x);
return z;
}
};
We could use this pattern instead:
template <typename T1, T2>
T2 do1(T1 x) { /* .. */}
template <typename T2, T3>
T3 do2(T2 x) { /* .. */}
template <typename T1, T2, T3>
T3 call(T1 x) {
auto y = do1(x);
auto z = do2(x);
return z;
}
This will reduce the number of instantiations. For instance, when the types Ti
can be float
or double
, then will have 8 instantiations of foo::do1
(one for each of the two possibilities of T1, T2, T3), but only 4 instantiations of do1
(one for each of the two possibilities of T1, T2). I hope that this can reduce build times, but I have not measured it.
I am guessing that @Nyrio has some input as well!
from raft.
Hi I am sharing my progress in this PR: #1228
I have so far reduced compile times by 22% and number of kernels by more than 50% with two 2-line changes in the pairwisedistance code.
from raft.
Another measurement technique we can use is add the --time nvcc_compile.log
flag to the nvcc
command-line. This will write the time taken by all intermediate compilation stages of nvcc
to the file in csv
format. We can make this file a downloadable artifact for further analysis, or create a graph/plot/table immediately in CI.
--time <file name> (-time)
Generate a comma separated value table with the time taken by each compilation
phase, and append it at the end of the file given as the option argument.
If the file is empty, the column headings are generated in the first row
of the table. If the file name is '-', the timing data is generated in stdout.
from raft.
For IVF-PQ compute similarity kernels, it does not add much value to have both int64
and uint64
specialization. I quick fix would be to just remove the :int64
version from all instantiations (and tests and benchmarks)
[update] Unfortunately we have public interface both for uint64
and int64
specialization:
- cuML uses the
int64
specialization https://github.com/rapidsai/cuml/blob/85b33dfd9d77d3ee38785bd3d5e1720c89fb5ebe/python/cuml/neighbors/nearest_neighbors.pyx#L757 - our python wrapppers for IVF-PQ use
uint64
@achirkin mentioned that it might be possible to keep in this internal kernel only uint32 (because we are working with smaller chunks of the dataset). He is investigating this option.
from raft.
Posting a ninjatracing log here to provide a breakdown of the compile times on my workstation. Just to focus on the things that other projects depend directly upon, this trace only includes source files compiled into the shared libs and not tests or benchmarks.
One good thing to note is that most of the source files which are bottlenecks are in a specializations
directory. Any source files compiling code for raft::runtime
should be directly calling a specialization. neighbors/ivfpq_search.cu
violates this currently so we should look into that for sure (as outlined in #1194) because it 1) needs to be split into separate compilation units and 2) use consistent template types so it can use the existing specializations.
Other offenders (not necessarily in order) in addition to ivfpq_search specializations:
- ball_cover.cu
- refine_*.cu
- knn.cu (hoping this might be largely fixed once @benfred's new brute-force changes are merged)
- lp_unexpanded_*.cu
- cosine.cu
- jensen_shannon.cu
from raft.
Here's the timeline for the end-to-end build w/o #1232:
And with #1232:
Unfortunately, I need to revert a couple changes for the release because the build time in CI has caused some problems for users. For 23.04, we should focus on getting those changes back while also introducing more changes like #1230 #1228 and #1220 to lower the build time further. It would be great it we can get the end-to-end build time under 20 minutes again on a single architecture. Now that we're building for 6 architectures w/ CUDA 11.8, we should strive to keep the initial cpp-build times in CI to just over an hour if possible.
from raft.
This is some additional analysis on the ninja log traces that @cjnolet shared. It shows the compilation units that have more than 1 minute compile time and compares between before and after the revert in #1232. The primary impact of the revert seems to be on lp_unexpanded_float_float_float_uint32.cu
and canberra.cu
.
from raft.
@ahendriksen that analysis is great! Strangely, I wasn't seeing that issue at all when compiling exclusively for sm_70. It appears so be something that starts w/ sm_80 and above? But out of the 6 architectures that compile in CI, I guess at least 4 of them are sm_80 and above so that might explain why that file alone is taking 4+ hours.
from raft.
I opened #1235 to remove all the uint32_t specializations.
from raft.
Related Issues (20)
- [FEA] Have maskedL2NN accept bitfields directly
- [FEA] Support reconstruction of original vectors from IVF-Flat / IVF-PQ Index HOT 3
- [FEA] Graph-based coarse quantizer
- [FEA] Incremental add for graph-based ANN indexes
- [BUG] Use proper random sampling to train coarse quantizer in IVF* methods
- [FEA] `mdspan` public API for IVF-PQ
- [FEA] Integrate IVF-PQ into FAISS PoC
- [FEA] Integrate cuda ANN benchmarks
- [FEA] Need to support `index_cpu_to_gpu` and `index_gpu_to_cpu` in FAISS PoC
- [FEA] Expose fused knn through the public API
- [BUG] Segmentation fault in interruptible.hpp HOT 4
- [FEA] Follow-up task for mdspan serializer
- [BUG] Ball Cover uses brute_force_knn specialization that's not instantiated HOT 1
- [BUG] IVF-flat needs specializations HOT 1
- [DOC] Find a good way to document instantiated types HOT 1
- [BUG] Potential CUDA compile time issue with loop unrolling or pow HOT 6
- [BUG] K-means overflows when `m` * `k` > 2^(31-1) HOT 1
- [FEA] Expose sparse matrix multiplication primitive
- [FEA] Discuss standardizing on a set of types that can be used consistently across the codebase
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from raft.