nlmatics / llmsherpa Goto Github PK

View Code? Open in Web Editor NEW

1.2K 11.0 122.0 174 KB

Developer APIs to Accelerate LLM Projects

Home Page: https://www.nlmatics.com

License: MIT License

Python 22.25% Jupyter Notebook 77.75%

llmsherpa's Introduction

LLM Sherpa

LLM Sherpa provides strategic APIs to accelerate large language model (LLM) use cases.

What's New

Important

llmsherpa back end service is now fully open sourced under Apache 2.0 Licence. See https://github.com/nlmatics/nlm-ingestor

You can now run your own servers using a docker image!
Support for different file formats: DOCX, PPTX, HTML, TXT, XML
OCR Support is built in
Blocks now have co-ordinates - use bbox propery of blocks such as sections
A new indent parser to better align all headings in a document to their corresponding level
The free server and paid server are not updated with latest code and users are requested to spawn their own servers using instructions in nlm-ingestor

LayoutPDFReader

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

Sections and subsections along with their levels.
Paragraphs - combines lines.
Links between sections and paragraphs.
Tables along with the section the tables are found in.
Lists and nested lists.
Join content spread across pages.
Removal of repeating headers and footers.
Watermark removal.

With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs.

You can experiment with the library directly in Google Colab here

Here's a writeup explaining the problem and our approach.

Here'a LlamaIndex blog explaining the need for smart chunking.

API Reference: https://llmsherpa.readthedocs.io/

How to use with Google Gemini Pro How to use with Cohere Embed3

Important Notes

The LayoutPDFReader is tested on a wide variety of PDFs. That being said, it is still challenging to get every PDF parsed correctly.
OCR is currently not supported. Only PDFs with a text layer are supported.

Note

LLMSherpa uses a free and open api server. The server does not store your PDFs except for temporary storage during parsing. This server will be decommissioned soon. Self-host your own private server using instructions at https://github.com/nlmatics/nlm-ingestor

Important

Private available at Microsoft Azure Marketplace will be decommissioned soon. Please move to your self-hosted instance using instructions at https://github.com/nlmatics/nlm-ingestor.

Installation

pip install llmsherpa

Read a PDF file

The first step in using the LayoutPDFReader is to provide a url or file path to it and get back a document object.

from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

Install LlamaIndex

In the following examples, we will use LlamaIndex for simplicity. Install the library if you haven't already.

pip install llama-index

Setup OpenAI

import openai
openai.api_key = #<Insert API Key>

Vector search and Retrieval Augmented Generation with Smart Chunking

LayoutPDFReader does smart chunking keeping related text due to document structure together:

All list items are together including the paragraph that precedes the list.
Items in a table are chuncked together
Contextual information from section headers and nested section headers is included

The following code creates a LlamaIndex query engine from LayoutPDFReader document chunks

from llama_index.core import Document
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
    index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

Let's run one query:

response = query_engine.query("list all the tasks that work with bart")
print(response)

We get the following response:

BART works well for text generation, comprehension tasks, abstractive dialogue, question answering, and summarization tasks.

Let's try another query that needs answer from a table:

response = query_engine.query("what is the bart performance score on squad")
print(response)

Here's the response we get:

The BART performance score on SQuAD is 88.8 for EM and 94.6 for F1.

Summarize a Section using prompts

LayoutPDFReader offers powerful ways to pick sections and subsections from a large document and use LLMs to extract insights from a section.

The following code looks for the Fine-tuning section of the document:

from IPython.core.display import display, HTML
selected_section = None
# find a section in the document by title
for section in doc.sections():
    if section.title == '3 Fine-tuning BART':
        selected_section = section
        break
# use include_children=True and recurse=True to fully expand the section. 
# include_children only returns at one sublevel of children whereas recurse goes through all the descendants
HTML(section.to_html(include_children=True, recurse=True))

Running the above code yields the following HTML output:

3 Fine-tuning BART
The representations produced by BART can be used in several ways for downstream applications.
3.1 Sequence Classiﬁcation Tasks
For sequence classiﬁcation tasks, the same input is fed into the encoder and decoder, and the ﬁnal hidden state of the ﬁnal decoder token is fed into new multi-class linear classiﬁer.\nThis approach is related to the CLS token in BERT; however we add the additional token to the end so that representation for the token in the decoder can attend to decoder states from the complete input (Figure 3a).
3.2 Token Classiﬁcation Tasks
For token classiﬁcation tasks, such as answer endpoint classiﬁcation for SQuAD, we feed the complete document into the encoder and decoder, and use the top hidden state of the decoder as a representation for each word.\nThis representation is used to classify the token.
3.3 Sequence Generation Tasks
Because BART has an autoregressive decoder, it can be directly ﬁne tuned for sequence generation tasks such as abstractive question answering and summarization.\nIn both of these tasks, information is copied from the input but manipulated, which is closely related to the denoising pre-training objective.\nHere, the encoder input is the input sequence, and the decoder generates outputs autoregressively.
3.4 Machine Translation
We also explore using BART to improve machine translation decoders for translating into English.\nPrevious work Edunov et al.\n(2019) has shown that models can be improved by incorporating pre-trained encoders, but gains from using pre-trained language models in decoders have been limited.\nWe show that it is possible to use the entire BART model (both encoder and decoder) as a single pretrained decoder for machine translation, by adding a new set of encoder parameters that are learned from bitext (see Figure 3b).
More precisely, we replace BART’s encoder embedding layer with a new randomly initialized encoder.\nThe model is trained end-to-end, which trains the new encoder to map foreign words into an input that BART can de-noise to English.\nThe new encoder can use a separate vocabulary from the original BART model.
We train the source encoder in two steps, in both cases backpropagating the cross-entropy loss from the output of the BART model.\nIn the ﬁrst step, we freeze most of BART parameters and only update the randomly initialized source encoder, the BART positional embeddings, and the self-attention input projection matrix of BART’s encoder ﬁrst layer.\nIn the second step, we train all model parameters for a small number of iterations.

Now, let's create a custom summary of this text using a prompt:

from llama_index.llms import OpenAI
context = selected_section.to_html(include_children=True, recurse=True)
question = "list all the tasks discussed and one line about each task"
resp = OpenAI().complete(f"read this text and answer question: {question}:\n{context}")
print(resp.text)

The above code results in following output:

Tasks discussed in the text:

1. Sequence Classification Tasks: The same input is fed into the encoder and decoder, and the final hidden state of the final decoder token is used for multi-class linear classification.
2. Token Classification Tasks: The complete document is fed into the encoder and decoder, and the top hidden state of the decoder is used as a representation for each word for token classification.
3. Sequence Generation Tasks: BART can be fine-tuned for tasks like abstractive question answering and summarization, where the encoder input is the input sequence and the decoder generates outputs autoregressively.
4. Machine Translation: BART can be used to improve machine translation decoders by incorporating pre-trained encoders and using the entire BART model as a single pretrained decoder. The new encoder parameters are learned from bitext.

Analyze a Table using prompts

With LayoutPDFReader, you can iterate through all the tables in a document and use the power of LLMs to analyze a Table Let's look at the 6th table in this document. If you are using a notebook, you can display the table as follows:

from IPython.core.display import display, HTML
HTML(doc.tables()[5].to_html())

The output table structure looks like this:

	SQuAD 1.1 EM/F1	SQuAD 2.0 EM/F1	MNLI m/mm	SST Acc	QQP Acc	QNLI Acc	STS-B Acc	RTE Acc	MRPC Acc	CoLA Mcc
BERT	84.1/90.9	79.0/81.8	86.6/-	93.2	91.3	92.3	90.0	70.4	88.0	60.6
UniLM	-/-	80.5/83.4	87.0/85.9	94.5	-	92.7	-	70.9	-	61.1
XLNet	89.0/94.5	86.1/88.8	89.8/-	95.6	91.8	93.9	91.8	83.8	89.2	63.6
RoBERTa	88.9/94.6	86.5/89.4	90.2/90.2	96.4	92.2	94.7	92.4	86.6	90.9	68.0
BART	88.8/94.6	86.1/89.2	89.9/90.1	96.6	92.5	94.9	91.2	87.0	90.4	62.8

Now let's ask a question to analyze this table:

from llama_index.llms import OpenAI
context = doc.tables()[5].to_html()
resp = OpenAI().complete(f"read this table and answer question: which model has the best performance on squad 2.0:\n{context}")
print(resp.text)

The above question will result in the following output:

The model with the best performance on SQuAD 2.0 is RoBERTa, with an EM/F1 score of 86.5/89.4.

That's it! LayoutPDFReader also supports tables with nested headers and header rows.

Here's an example with nested headers:

from IPython.core.display import display, HTML
HTML(doc.tables()[6].to_html())

	CNN/DailyMail			XSum		-
	R1	R2	RL	R1	R2	RL
---	---	---	---	---	---	---
Lead-3	40.42	17.62	36.67	16.30	1.60	11.95
PTGEN (See et al., 2017)	36.44	15.66	33.42	29.70	9.21	23.24
PTGEN+COV (See et al., 2017)	39.53	17.28	36.38	28.10	8.02	21.72
UniLM	43.33	20.21	40.51	-	-	-
BERTSUMABS (Liu & Lapata, 2019)	41.72	19.39	38.76	38.76	16.33	31.15
BERTSUMEXTABS (Liu & Lapata, 2019)	42.13	19.60	39.18	38.81	16.50	31.27
BART	44.16	21.28	40.90	45.14	22.27	37.25

Now let's ask an interesting question:

from llama_index.llms import OpenAI
context = doc.tables()[6].to_html()
question = "tell me about R1 of bart for different datasets"
resp = OpenAI().complete(f"read this table and answer question: {question}:\n{context}")
print(resp.text)

And we get the following answer:

R1 of BART for different datasets:

- For the CNN/DailyMail dataset, the R1 score of BART is 44.16.
- For the XSum dataset, the R1 score of BART is 45.14.

Get the Raw JSON

To get the complete json returned by llmsherpa service and process it differently, simply get the json attribute

doc.json

llmsherpa's People

Contributors

Stargazers

Watchers

Forkers

jsv4 shuxiangzhang thanhpham1987 ell-hol ankushmulkar davgit serignecisse sukantag kalufinnle popupbuddy touristshaun techthiyanes nashid blackwhites muharremokutan b-zwarg doutianbao inf800 moshewe diningsystem clbrge diamondsea cacoderquan mmmkkk888 henrywoo miguelramosfdz codehornets haoyitedaniu moorbles tonywhite11 madhat2r fluid-ai ndrdst reema93jain sivasurend blisssan mjdhasan kamilnowakflyps francyjglisboa codeaudit tinghao724 sumit6597 lawrenceemenike machallboyd mbyanfei drewskidang joennlae kai-hubs kevinprinsloo cupcoder ayushmodi038 shashipal95 yuksel-arslan pedramrzm evelynmitchell overcyber pks20iitk namtho7078 qzcai sbadisa ink-splatters healthmemmo afro-lingo svorwerk-flextg lokeshjonnakuti mjwgoh ai-mou sessycode favazmuhammed stormpham sharpboy2008 lebigot mrenlivex mandanafasounaki kurcontko glazyee ishan-marikar thomsonreuters sonalshad arioliv yasinsb oreh zhangjiekui feihu618 eagerworks sciumo ozozozd octag0no dmkwon isalzh kevinwengh j-lim-sigma anotherdayinparadise cgslivv xiaomujiang nu11b0t jinkjonks hellerphilipp prabhabharadwaj kmouratidis

llmsherpa's Issues

MaxRetryError HTTPSConnectionPool

MaxRetryError HTTPSConnectionPool(host-'readers.11msherpa.com', port-443): Max retries exceeded with url:

Can I get coordinates for each chunk?

First of all, thank you for implementing such a wonderful library.

I tried using it and got some good results.

I thought it would be better if there was coordinate information for each chunk, but do you have any plans to implement it in the future?

I think it would be technically difficult to obtain data that includes coordinate information...

How to add custom parser API URL

Looking at the documentation, to parse a pdf one has to pass a parser url which is basically an api that does all the magic (chunks, sections, paragraphs etc.).
I was wondering where this api is hosted and if ever we can self host this api to use in a local environment.

In the doc here it is written the following : Use customer url for your private instance here

Langchain integration

HI, Thanks for this amazing lib. Any plans for Langchain or Open AI Assistants API ihtegration

Source Citations

Trying to sort out how to best set up llmsherpa to cooperate with a llama-index Query Engine system where I need to retrieve source citations. I was having a hard time when using it as a loader, wondering if there's a way to implement it as a parser/sentence splitter?

Would appreciate any help you can offer - the results after swapping out PDFMiner were undeniable - instant 80% boost in accuracy and understanding.

error loading a document

When i load a document for about 800 pages the connection runs out, is it a common happening for huge files??

How can I check about 'block_class'?

At first, I want to say thank for your great job
I was surprised because this API can recognize Korean.
Very impressive.

Everything is perfect except specific form of table.
So, I want to remove this type of table result from LayoutPDFReader result
But I cannot extract specific form of table which LayoutPDFReader is not properly recognized

But I noticed, with 'block_class' I can detect it.
So I want to know and check what is 'block_class'.

Thanks and sorry for my pool English

Can't get Demo to run with PDF file: List Index out of range

I just quickly wanted to test out the performance of this interesting approach by using the demo colab notebook, but it yields an error when loading the PDF on the very first lines... ?

LayoutPDFReader._parse_pdf returns error when pdf contains empty pages

I tried processing a pdf file using the LayoutPDFReader.read_pdf() method, but got a KeyError for response_json['return_dict']['result']['blocks'], since the response did not contain results, because there was an error (on a side node: would be nice to have a specific error in this case instead of a key error, clearly stating that the file could not be processed and the reason why).

I split my pdf in pages and processed each page separately to understand what the issue was. Turns out that the error existed every time an empty page was being processed. I am not sure whether this is the case for empty pages of all types of pdfs or just for some pdf types (there are small differences between text pdfs depending on how they were created). It only occurred on one of the pdfs I was processing, but it was also the only pdf with empty pages...

Better: do not fail processing of a whole document if it has one empty page, but simply skip that page.

KeyError: result

Hi, I am facing the same error, I have 5 PDF files. The code works for 4 except one, I tried running it a few times like @jalkestrup but it still throws the error:
KeyError: 'result'
I would really appreciate any support from authors/community on this!

P.S. Since the issue wasn't resolved hence reopening the issue :)

Azure MarketPlace cannot provide stable service

We have recently deployed the LLMSherpa service through Azure Marketplace and are experiencing an intermittent issue. Specifically, the API suddenly stops responding to any requests, failing to return results. This issue persists even after restarting both the Azure machine and the service itself. However, we have noticed that the service spontaneously resumes normal operation the following day. Additionally, we are unable to log into the deployed machine to check if the service is functioning normally, and we cannot access any logs.

Has anyone else encountered similar issues with services deployed via Azure Marketplace? If so, could you please share any insights or solutions to this problem? Additionally, we are interested in knowing if there are alternative, more stable API services available that we could consider.

Thank you for your assistance.

PDF size limit - apologies if I caused problems

Apologies if I cause problems just recently.
Uploaded a huge complicated PDF file (over 200MB) to test the module.
Is there a limit on a file size we need to observe to not cause issues ?

read_pdf fails on specific pdf locally, not through hosted api

PDF in question:
JTR.pdf

This api call works great
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

This local call fails
llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"](http://localhost:5010/api/parseDocument?renderFormat=all%22) pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

The local version is running from the latest docker build. Other pdfs work fine. Is there a way to get a better error message? Currently receiving: KeyError: 'return_dict'

I noticed there are other issues open around this error but did not find any matching this case where it works on one and not the other.

I appreciate your time and any insight. Thanks!

Nodes and llama idnex

when using the llama index example
from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))

are the chunks converted to nodes??? I'm tyring to use pinecone but it requires documents or nodes for ingestion but this happens after the index is created

JSONDecodeError

self.pdf_reader = readers.LayoutPDFReader(
self.api_url,
)
document = self.pdf_reader.read_pdf(final_location)

returns

Traceback (most recent call last):
File "", line 1, in
File "", line 21, in read_document
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/site-packages/llmsherpa/readers/file_reader.py", line 72, in read_pdf
response_json = json.loads(parser_response.data.decode("utf-8"))
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Looks like there's an issue with handling jsondecoding errors.

I just provided a local file path.

Skips first few lines in PDF.

I'm running the server in docker:

image: ghcr.io/nlmatics/nlm-ingestor:latest

I've only tested with one 300page PDF and it seems to skip the first couple lines of the PDF. It doesn't seem to be an issue but It makes me wonder if anything else is being skipped. This is the same whether I convert to text, use sections, or convert to html.

What might be the cause?

Ingestion Failed

I am getting ingestion failed when I try to hit the endpoint https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all

to fetch chunks on a any Pdf document

This is the response:

{'return_dict': {}, 'status': 'ingest_failed'}

I tried printing out the response_json

Timeout Error

When I try to load a PDF that is 541 pages long (~9.5 MB), I get the following error message:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I assume it's due to the large file size? I don't have the same issue loading smaller files.

InternalServerError: Error code: 503

from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

InternalServerError Traceback (most recent call last)
Input In [17], in <cell line: 5>()
4 index = VectorStoreIndex([])
5 for chunk in doc.chunks():
----> 6 index.insert(Document(text=chunk.to_context_text(), extra_info={'embed_model':'text-embedding-V2'}))
7 query_engine = index.as_query_engine()

InternalServerError: Error code: 503 - {'error': {'message': 'There are no available channels for model text embedding ada-002 under the current group VIP (request id: 20240108173434543639222FMD9UnZh)', 'type': 'new_api_error'}}

question:How to define parameter change models？

Getting a json parse exception when trying to use the `content=` parameter of LayoutPDFReader.read_pdf() with a None value.

Steps taken:

from llmsherpa.readers import LayoutPDFReader
from pathlib import Path

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
parser= LayoutPDFReader(llmsherpa_api_url)
path = Path('yp') / 'tests' / 'content' / 'Ambrx EX-2.1.pdf'
with open(path, 'rb') as f:
    content = f.read()
parser.read_pdf(None, content)

resulting stack trace:

Traceback (most recent call last):
  File "/home/mboyd/.pycharm_helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
           ^^^^^^
  File "<input>", line 1, in <module>
  File "/home/mboyd/.virtualenvs/yp-demo/lib/python3.12/site-packages/llmsherpa/readers/file_reader.py", line 72, in read_pdf
    response_json = json.loads(parser_response.data.decode("utf-8"))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

However, parser.read_pdf('', contents=content) DOES successfully parse, as an empty string evaluates to false and cleanly converts to valid JSON in _parse_pdf(), unlike None. None would be the normal pythonic way of specifying no value, however.

ImportError: cannot import name 'AsyncAzureOpenAI' from 'openai'

This error was generated using your code example.

Traceback (most recent call last):
File "C:\Users\x\wc-chat-pdf-py-willis\layout-pdf-reader.py", line 2, in
from llama_index.readers.schema.base import Document
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index_init_.py", line 17, in
from llama_index.embeddings.langchain import LangchainEmbedding
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index\embeddings_init_.py", line 7, in
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index\embeddings\azure_openai.py", line 3, in
from openai import AsyncAzureOpenAI, AzureOpenAI
ImportError: cannot import name 'AsyncAzureOpenAI' from 'openai' (C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai_init_.py)

timeout error?

I am getting this error while reading local pdf (quite bit though). FWIW, Unstructured API managed it well.


File [/opt/homebrew/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:41](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:41), in LayoutPDFReader.read_pdf(self, path_or_url)
     39 parser_response = self._parse_pdf(pdf_file)
     40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
     42 return Document(blocks)

KeyError: 'result'```

Connection Error

I am using latest LLLSherpa to chunk a pdf but I always get this SSLCertVerificationError. I am using python 3.12 and using simple code llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url). Looks like it is a known issue and could have been resolved by disabling SSL check but I could not find anyway to handle it as connections are made by LayoutPDFReader with no handle to disable SSL check. Please guide me.

Error Details:
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)

MaxRetryError: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: /api/document/developer/parseDocument?renderFormat=all (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))