My first attempt to use LangChain for RAG.
To run, at command prompt: streamlit run ./app.py
To debug, press F5 in vs code. However the streamlit UI would run several threads and interfere with the debugger.
The backend LLMs are local or remote running either on LM Studio on Ollama. Using machine with GPU or NPU would help to reduce the response wait to below a minute. Otherwise, a pure CPU running local LLM would require waiting time of several minutes for a response.
The docs folder contains .md diary text entries generated by Bing chat of a fictious Singaporean male. The Python script uses Chroma DB to store and retrieve vectorized sentence embeddings. Before storing the vectors, the diary entries are split up into chunks then vectorized. You may add pdfs and htmls into the root or subfolders as you wished, the script will detect .md, .html and .pdf files.
The LamgChain QA retriever would convert the user query into vector embeddings and search in the Chroma DB for the closest related vectors to obtain the relevant chunks. These are mathematical calculation to find the closest related sentences to the query, by comparing the spatial distance between their vector representations.
These retrieved chunks are then used as context to send to LLM, along with the user query after merging into the prompt template. I have tested on the model: QuantFactory - Meta Llama 3 Instruct 7B Q5_K_M.guff model. This simple RAG is satisfactory for the simple document repository I tested on but the quality may suffer on complicated documents.