Memory Cache should use a GPU that is available to do inference in order to speed up performance of queries and deriving insights from documents.
What I tried so far
I spent a few days last week exploring the differences between the primordial privateGPT version and latest. One of the major differences is that the newer project updates include support for GPU inference for llama and gpt4all, but the challenge that I ran into with the newer version is that moving from the older groovy.ggml model (which is no longer supported given that privateGPT now uses the .gguf format) to llama doesn't have the same results when ingesting the same local file store and querying.
This might be a matter of how RAG is implemented, something about how I set things up on my local machine, or a function of model choice.
I've lazily tried to see if this can be resolved through dependency changes but I haven't had luck getting to a version that runs that supports .ggml and GPU acceleration together. From what I can tell, Nomic introduced a version of gpt4all that works on GPU in 2.4 (latest is 2.5+) but it's unclear if there's a way to get this working cleanly with minimal changes to how my fork of privateGPT uses langchain to import the gpt4all package. It is unclear to me if this works on Ubuntu or if it's only Vulkan APIs, I need to do some additional investigation.
I did get CUDA installed and verified that my GPU is properly detected and set up to run the sample projects provided by Nvidia.
What's next
Testing
I've been using a highly subjective test to evaluate:
Prompt: "What is the meaning of a life well-lived?"
The answer for primordial privateGPT+groovy that has been augmented on my local files answers this question with a combination of "technology and community" consistently. No other combination of model/project has replicated that consistently.