Comments (5)
Hi Robert! I haven't tested PaCMAP on such large scale cases. The largest case I tested is around 1.8B, which requires ~42min to finish with a 48-core Intel Xeon Gold 6226 Processor, and PaCMAP's running time should scale linearly w.r.t. number of rows. Regarding memory, this case successfully finished using less than 64GB of RAM, but I don't have the exact RAM usage number.
Regarding multi-node support, at this moment we don't have a plan for it.
from pacmap.
numpy.core._exceptions.MemoryError: Unable to allocate 286. GiB for an array with shape (100000000, 768) and data type float32
@hyhuang00 would I be able to iterate over smaller subsets (.fit(batch1), .fit(batch2) etc.) without completely losing the information of the first batches? Or does .fit completely start from zero when do an additional fit? Or what would be the best way to "fit" more than one batch and do the transform afterwards for every batch?
from pacmap.
At this moment, fit()
will forget about the previous data and completely start from zero. One possibility is fit()
the dataset on a small batch, and then transform()
the other parts. This will be able to handle many situations if the initial batch is large and representative enough, but it will fail if the initial batch fails to capture some of the information.
from pacmap.
I just posted a question about speeding up processing of a large dataset. I am also trying to apply pacmap to embeddings with dimension 768.
Does your timing correspond to mine? About 35min/1M rows?
from pacmap.
I tested the 1M case on a dataset with 100 dimensions. For dataset with larger than 100 dimensions, we will apply a PCA to reduce it down to 100 first, which means there's an extra offset. 35min/1M rows would be a good estimate for things that happen afterwards (Pair construction, embedding optimization, etc.)
from pacmap.
Related Issues (20)
- Early stopping in third phase of training HOT 1
- Storing PaCMAP on DB? HOT 5
- Multiprocessor support HOT 5
- PaCMAP is stochastic with sparse data even if random seed is set HOT 7
- Possible bug: plotting intermediate snapshots HOT 3
- allow test/query data to be used with Transform() API call HOT 1
- Readme n_dims vs. n_components HOT 1
- Import performance HOT 2
- transform() doesn't work HOT 4
- Rainbow Plots For Bad Loss! HOT 1
- inverse_transform() in PaCMAP HOT 1
- Segmentation fault when running model in loop HOT 1
- Save a PaCMAP model HOT 3
- Error when the number of instances grow to large HOT 2
- `fit_transform` and `transform` on the same feature doesn't return the same value HOT 9
- speed up processing large dataset HOT 1
- Is `inverse_transform` possible with PaCMAP? HOT 1
- Add PaCMAP to Conda-forge please HOT 2
- Metric learning with PaCMAP? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pacmap.