raghavan / instantqna Goto Github PK

24.0 2.0 0.0 7 KB

Build instant question and answer engine for PDF books using OpenAI Models

License: MIT License

Python 100.00%

instantqna's Introduction

InstantQnA

Introducing the Instant QnA builder - a powerful tool that allows you to quickly and easily create searchable QnA systems from PDF files. Using state-of-the-art OpenAI technology, this tool generates search embeddings for your documents, making it easy to find the information you need.

How it works ?

Install the project's dependencies:

Windows:

pip install -r requirements.txt

Unix:

python3 -m pip install -r requirements.txt

Update constants.py, with your OpenAI API Token
```
token="<YOUR-OPENAI-API-TOKEN>"
```
Place PDFs that you want to search inside /sources directory
Run the program

Windows:
```
python main.py
```
Unix:
```
python3 main.py
```
An estimated cost to embed all of the files will be prompted for y/n. Choose y to proceed further. By default this engine use text-embedding-ada-002 which is less expensive and also perfomant. You can update the code to embed using other models like davinci, etc...
Once all of the files are full processed and embedded, then the program will show a prompt for you to enter your search query, if there are matching results it will return top 3 results with their score and source file name.

Usage

If you have PDF files from which you want to build a question and answer engine, this tool should be useful for you.

Upload PDF file

To begin, select the PDF file that you want to create a QnA system for and upload it to the tool.

`read_source.py`

This python file reads all of the PDFs file from /sources and then write all of its text content to /ai_generated/dumps.

`get_file_data.py`

Go through all files in sources and collect which file that hasn't been embedded yet, or the embedding has expired.

Generate search embeddings

Once the file is uploaded, generate search embeddings for the contents of the PDF. This process may take a few minutes, depending on the size of the file.

`create_dataset.py`

Parses through all text content within a PDF, grouping them into coherent paragraphs no longer than 1000 tokens. This dataset is then saved in a CSV format, providing a structured and readable format for an AI model to process.

`embed.py`

This file creates the text embedding using OpenAI Ada model (you can customize to any model) and also provides the search/query functions

Execute search queries

You can now execute search queries to find the information you need. Enter your query in the search box and the tool will return any matching results from the PDF.

`main.py`

The main function where you run the project

instantqna's People

Contributors

Stargazers

Watchers

instantqna's Issues

Geeting Invalid API token

I Was trying to run this code but getting below error:

➜ InstantQnA git:(main) ✗ python3 main.py

                                    Instant Q&A Builder

ai_generated/embeds/bill (2).pdf.json EMBED NOT FOUND

Reading pdf text content: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 308.78it/s]

Creating dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 89.60it/s]

Calculating total token: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 294.01it/s]

0 tokens in total, (approx. $0.0)
Would you like to embed? (y/n)y
Embedding files: 0%| | 0/1 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Invalid API token
Embedding files: 100%|█

raghavan / instantqna Goto Github PK

instantqna's Introduction

InstantQnA

How it works ?

Usage

Upload PDF file

read_source.py

get_file_data.py

Generate search embeddings

create_dataset.py

embed.py