Code Monkey home page Code Monkey logo

llmsherpa's Introduction

LLM Sherpa

LLM Sherpa provides strategic APIs to accelerate large language model (LLM) use cases.

What's New

Important

llmsherpa back end service is now fully open sourced under Apache 2.0 Licence. See https://github.com/nlmatics/nlm-ingestor

  • You can now run your own servers using a docker image!
  • Support for different file formats: DOCX, PPTX, HTML, TXT, XML
  • OCR Support is built in
  • Blocks now have co-ordinates - use bbox propery of blocks such as sections
  • A new indent parser to better align all headings in a document to their corresponding level
  • The free server and paid server are not updated with latest code and users are requested to spawn their own servers using instructions in nlm-ingestor

LayoutPDFReader

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

  1. Sections and subsections along with their levels.
  2. Paragraphs - combines lines.
  3. Links between sections and paragraphs.
  4. Tables along with the section the tables are found in.
  5. Lists and nested lists.
  6. Join content spread across pages.
  7. Removal of repeating headers and footers.
  8. Watermark removal.

With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs.

You can experiment with the library directly in Google Colab here

Here's a writeup explaining the problem and our approach.

Here'a LlamaIndex blog explaining the need for smart chunking.

API Reference: https://llmsherpa.readthedocs.io/

How to use with Google Gemini Pro How to use with Cohere Embed3

Important Notes

  • The LayoutPDFReader is tested on a wide variety of PDFs. That being said, it is still challenging to get every PDF parsed correctly.
  • OCR is currently not supported. Only PDFs with a text layer are supported.

Note

LLMSherpa uses a free and open api server. The server does not store your PDFs except for temporary storage during parsing. This server will be decommissioned soon. Self-host your own private server using instructions at https://github.com/nlmatics/nlm-ingestor

Important

Private available at Microsoft Azure Marketplace will be decommissioned soon. Please move to your self-hosted instance using instructions at https://github.com/nlmatics/nlm-ingestor.

Installation

pip install llmsherpa

Read a PDF file

The first step in using the LayoutPDFReader is to provide a url or file path to it and get back a document object.

from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

Install LlamaIndex

In the following examples, we will use LlamaIndex for simplicity. Install the library if you haven't already.

pip install llama-index

Setup OpenAI

import openai
openai.api_key = #<Insert API Key>

Vector search and Retrieval Augmented Generation with Smart Chunking

LayoutPDFReader does smart chunking keeping related text due to document structure together:

  • All list items are together including the paragraph that precedes the list.
  • Items in a table are chuncked together
  • Contextual information from section headers and nested section headers is included

The following code creates a LlamaIndex query engine from LayoutPDFReader document chunks

from llama_index.core import Document
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
    index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

Let's run one query:

response = query_engine.query("list all the tasks that work with bart")
print(response)

We get the following response:

BART works well for text generation, comprehension tasks, abstractive dialogue, question answering, and summarization tasks.

Let's try another query that needs answer from a table:

response = query_engine.query("what is the bart performance score on squad")
print(response)

Here's the response we get:

The BART performance score on SQuAD is 88.8 for EM and 94.6 for F1.

Summarize a Section using prompts

LayoutPDFReader offers powerful ways to pick sections and subsections from a large document and use LLMs to extract insights from a section.

The following code looks for the Fine-tuning section of the document:

from IPython.core.display import display, HTML
selected_section = None
# find a section in the document by title
for section in doc.sections():
    if section.title == '3 Fine-tuning BART':
        selected_section = section
        break
# use include_children=True and recurse=True to fully expand the section. 
# include_children only returns at one sublevel of children whereas recurse goes through all the descendants
HTML(section.to_html(include_children=True, recurse=True))

Running the above code yields the following HTML output:

3 Fine-tuning BART

The representations produced by BART can be used in several ways for downstream applications.

3.1 Sequence Classification Tasks

For sequence classification tasks, the same input is fed into the encoder and decoder, and the final hidden state of the final decoder token is fed into new multi-class linear classifier.\nThis approach is related to the CLS token in BERT; however we add the additional token to the end so that representation for the token in the decoder can attend to decoder states from the complete input (Figure 3a).

3.2 Token Classification Tasks

For token classification tasks, such as answer endpoint classification for SQuAD, we feed the complete document into the encoder and decoder, and use the top hidden state of the decoder as a representation for each word.\nThis representation is used to classify the token.

3.3 Sequence Generation Tasks

Because BART has an autoregressive decoder, it can be directly fine tuned for sequence generation tasks such as abstractive question answering and summarization.\nIn both of these tasks, information is copied from the input but manipulated, which is closely related to the denoising pre-training objective.\nHere, the encoder input is the input sequence, and the decoder generates outputs autoregressively.

3.4 Machine Translation

We also explore using BART to improve machine translation decoders for translating into English.\nPrevious work Edunov et al.\n(2019) has shown that models can be improved by incorporating pre-trained encoders, but gains from using pre-trained language models in decoders have been limited.\nWe show that it is possible to use the entire BART model (both encoder and decoder) as a single pretrained decoder for machine translation, by adding a new set of encoder parameters that are learned from bitext (see Figure 3b).

More precisely, we replace BART’s encoder embedding layer with a new randomly initialized encoder.\nThe model is trained end-to-end, which trains the new encoder to map foreign words into an input that BART can de-noise to English.\nThe new encoder can use a separate vocabulary from the original BART model.

We train the source encoder in two steps, in both cases backpropagating the cross-entropy loss from the output of the BART model.\nIn the first step, we freeze most of BART parameters and only update the randomly initialized source encoder, the BART positional embeddings, and the self-attention input projection matrix of BART’s encoder first layer.\nIn the second step, we train all model parameters for a small number of iterations.

Now, let's create a custom summary of this text using a prompt:

from llama_index.llms import OpenAI
context = selected_section.to_html(include_children=True, recurse=True)
question = "list all the tasks discussed and one line about each task"
resp = OpenAI().complete(f"read this text and answer question: {question}:\n{context}")
print(resp.text)

The above code results in following output:

Tasks discussed in the text:

1. Sequence Classification Tasks: The same input is fed into the encoder and decoder, and the final hidden state of the final decoder token is used for multi-class linear classification.
2. Token Classification Tasks: The complete document is fed into the encoder and decoder, and the top hidden state of the decoder is used as a representation for each word for token classification.
3. Sequence Generation Tasks: BART can be fine-tuned for tasks like abstractive question answering and summarization, where the encoder input is the input sequence and the decoder generates outputs autoregressively.
4. Machine Translation: BART can be used to improve machine translation decoders by incorporating pre-trained encoders and using the entire BART model as a single pretrained decoder. The new encoder parameters are learned from bitext.

Analyze a Table using prompts

With LayoutPDFReader, you can iterate through all the tables in a document and use the power of LLMs to analyze a Table Let's look at the 6th table in this document. If you are using a notebook, you can display the table as follows:

from IPython.core.display import display, HTML
HTML(doc.tables()[5].to_html())

The output table structure looks like this:

SQuAD 1.1 EM/F1 SQuAD 2.0 EM/F1 MNLI m/mm SST Acc QQP Acc QNLI Acc STS-B Acc RTE Acc MRPC Acc CoLA Mcc
BERT 84.1/90.9 79.0/81.8 86.6/- 93.2 91.3 92.3 90.0 70.4 88.0 60.6
UniLM -/- 80.5/83.4 87.0/85.9 94.5 - 92.7 - 70.9 - 61.1
XLNet 89.0/94.5 86.1/88.8 89.8/- 95.6 91.8 93.9 91.8 83.8 89.2 63.6
RoBERTa 88.9/94.6 86.5/89.4 90.2/90.2 96.4 92.2 94.7 92.4 86.6 90.9 68.0
BART 88.8/94.6 86.1/89.2 89.9/90.1 96.6 92.5 94.9 91.2 87.0 90.4 62.8

Now let's ask a question to analyze this table:

from llama_index.llms import OpenAI
context = doc.tables()[5].to_html()
resp = OpenAI().complete(f"read this table and answer question: which model has the best performance on squad 2.0:\n{context}")
print(resp.text)

The above question will result in the following output:

The model with the best performance on SQuAD 2.0 is RoBERTa, with an EM/F1 score of 86.5/89.4.

That's it! LayoutPDFReader also supports tables with nested headers and header rows.

Here's an example with nested headers:

from IPython.core.display import display, HTML
HTML(doc.tables()[6].to_html())
CNN/DailyMail XSum -
R1 R2 RL R1 R2 RL
--- --- --- --- --- --- ---
Lead-3 40.42 17.62 36.67 16.30 1.60 11.95
PTGEN (See et al., 2017) 36.44 15.66 33.42 29.70 9.21 23.24
PTGEN+COV (See et al., 2017) 39.53 17.28 36.38 28.10 8.02 21.72
UniLM 43.33 20.21 40.51 - - -
BERTSUMABS (Liu & Lapata, 2019) 41.72 19.39 38.76 38.76 16.33 31.15
BERTSUMEXTABS (Liu & Lapata, 2019) 42.13 19.60 39.18 38.81 16.50 31.27
BART 44.16 21.28 40.90 45.14 22.27 37.25

Now let's ask an interesting question:

from llama_index.llms import OpenAI
context = doc.tables()[6].to_html()
question = "tell me about R1 of bart for different datasets"
resp = OpenAI().complete(f"read this table and answer question: {question}:\n{context}")
print(resp.text)

And we get the following answer:

R1 of BART for different datasets:

- For the CNN/DailyMail dataset, the R1 score of BART is 44.16.
- For the XSum dataset, the R1 score of BART is 45.14.

Get the Raw JSON

To get the complete json returned by llmsherpa service and process it differently, simply get the json attribute

doc.json

llmsherpa's People

Contributors

aaryantr avatar ansukla avatar ioanadragan avatar jsv4 avatar kiran-nlmatics avatar lebigot avatar moshewe avatar sonalshad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llmsherpa's Issues

Can I get coordinates for each chunk?

First of all, thank you for implementing such a wonderful library.

I tried using it and got some good results.

I thought it would be better if there was coordinate information for each chunk, but do you have any plans to implement it in the future?

I think it would be technically difficult to obtain data that includes coordinate information...

How to add custom parser API URL

Looking at the documentation, to parse a pdf one has to pass a parser url which is basically an api that does all the magic (chunks, sections, paragraphs etc.).
I was wondering where this api is hosted and if ever we can self host this api to use in a local environment.

In the doc here it is written the following : Use customer url for your private instance here

Langchain integration

HI, Thanks for this amazing lib. Any plans for Langchain or Open AI Assistants API ihtegration

Source Citations

Trying to sort out how to best set up llmsherpa to cooperate with a llama-index Query Engine system where I need to retrieve source citations. I was having a hard time when using it as a loader, wondering if there's a way to implement it as a parser/sentence splitter?

Would appreciate any help you can offer - the results after swapping out PDFMiner were undeniable - instant 80% boost in accuracy and understanding.

error loading a document

When i load a document for about 800 pages the connection runs out, is it a common happening for huge files??
error

How can I check about 'block_class'?

At first, I want to say thank for your great job
I was surprised because this API can recognize Korean.
Very impressive.

Everything is perfect except specific form of table.
So, I want to remove this type of table result from LayoutPDFReader result
But I cannot extract specific form of table which LayoutPDFReader is not properly recognized

But I noticed, with 'block_class' I can detect it.
So I want to know and check what is 'block_class'.

Thanks and sorry for my pool English

LayoutPDFReader._parse_pdf returns error when pdf contains empty pages

I tried processing a pdf file using the LayoutPDFReader.read_pdf() method, but got a KeyError for response_json['return_dict']['result']['blocks'], since the response did not contain results, because there was an error (on a side node: would be nice to have a specific error in this case instead of a key error, clearly stating that the file could not be processed and the reason why).

I split my pdf in pages and processed each page separately to understand what the issue was. Turns out that the error existed every time an empty page was being processed. I am not sure whether this is the case for empty pages of all types of pdfs or just for some pdf types (there are small differences between text pdfs depending on how they were created). It only occurred on one of the pdfs I was processing, but it was also the only pdf with empty pages...

Better: do not fail processing of a whole document if it has one empty page, but simply skip that page.

KeyError: result

Hi, I am facing the same error, I have 5 PDF files. The code works for 4 except one, I tried running it a few times like @jalkestrup but it still throws the error:
KeyError: 'result'
I would really appreciate any support from authors/community on this!

P.S. Since the issue wasn't resolved hence reopening the issue :)

Azure MarketPlace cannot provide stable service

We have recently deployed the LLMSherpa service through Azure Marketplace and are experiencing an intermittent issue. Specifically, the API suddenly stops responding to any requests, failing to return results. This issue persists even after restarting both the Azure machine and the service itself. However, we have noticed that the service spontaneously resumes normal operation the following day. Additionally, we are unable to log into the deployed machine to check if the service is functioning normally, and we cannot access any logs.

Has anyone else encountered similar issues with services deployed via Azure Marketplace? If so, could you please share any insights or solutions to this problem? Additionally, we are interested in knowing if there are alternative, more stable API services available that we could consider.

Thank you for your assistance.

PDF size limit - apologies if I caused problems

Apologies if I cause problems just recently.
Uploaded a huge complicated PDF file (over 200MB) to test the module.
Is there a limit on a file size we need to observe to not cause issues ?

read_pdf fails on specific pdf locally, not through hosted api

PDF in question:
JTR.pdf

This api call works great
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

This local call fails
llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"](http://localhost:5010/api/parseDocument?renderFormat=all%22) pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

The local version is running from the latest docker build. Other pdfs work fine. Is there a way to get a better error message? Currently receiving: KeyError: 'return_dict'

I noticed there are other issues open around this error but did not find any matching this case where it works on one and not the other.

I appreciate your time and any insight. Thanks!

Nodes and llama idnex

when using the llama index example
from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))

are the chunks converted to nodes??? I'm tyring to use pinecone but it requires documents or nodes for ingestion but this happens after the index is created

JSONDecodeError

self.pdf_reader = readers.LayoutPDFReader(
self.api_url,
)
document = self.pdf_reader.read_pdf(final_location)

returns

Traceback (most recent call last):
File "", line 1, in
File "", line 21, in read_document
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/site-packages/llmsherpa/readers/file_reader.py", line 72, in read_pdf
response_json = json.loads(parser_response.data.decode("utf-8"))
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Looks like there's an issue with handling jsondecoding errors.

I just provided a local file path.

Skips first few lines in PDF.

I'm running the server in docker:

image: ghcr.io/nlmatics/nlm-ingestor:latest

I've only tested with one 300page PDF and it seems to skip the first couple lines of the PDF. It doesn't seem to be an issue but It makes me wonder if anything else is being skipped. This is the same whether I convert to text, use sections, or convert to html.

What might be the cause?

Timeout Error

When I try to load a PDF that is 541 pages long (~9.5 MB), I get the following error message:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I assume it's due to the large file size? I don't have the same issue loading smaller files.

InternalServerError: Error code: 503

from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

InternalServerError Traceback (most recent call last)
Input In [17], in <cell line: 5>()
4 index = VectorStoreIndex([])
5 for chunk in doc.chunks():
----> 6 index.insert(Document(text=chunk.to_context_text(), extra_info={'embed_model':'text-embedding-V2'}))
7 query_engine = index.as_query_engine()

InternalServerError: Error code: 503 - {'error': {'message': 'There are no available channels for model text embedding ada-002 under the current group VIP (request id: 20240108173434543639222FMD9UnZh)', 'type': 'new_api_error'}}

question:How to define parameter change models?

Getting a json parse exception when trying to use the `content=` parameter of LayoutPDFReader.read_pdf() with a None value.

Steps taken:

from llmsherpa.readers import LayoutPDFReader
from pathlib import Path

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
parser= LayoutPDFReader(llmsherpa_api_url)
path = Path('yp') / 'tests' / 'content' / 'Ambrx EX-2.1.pdf'
with open(path, 'rb') as f:
    content = f.read()
parser.read_pdf(None, content)

resulting stack trace:

Traceback (most recent call last):
  File "/home/mboyd/.pycharm_helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
           ^^^^^^
  File "<input>", line 1, in <module>
  File "/home/mboyd/.virtualenvs/yp-demo/lib/python3.12/site-packages/llmsherpa/readers/file_reader.py", line 72, in read_pdf
    response_json = json.loads(parser_response.data.decode("utf-8"))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

However, parser.read_pdf('', contents=content) DOES successfully parse, as an empty string evaluates to false and cleanly converts to valid JSON in _parse_pdf(), unlike None. None would be the normal pythonic way of specifying no value, however.

ImportError: cannot import name 'AsyncAzureOpenAI' from 'openai'

This error was generated using your code example.

Traceback (most recent call last):
File "C:\Users\x\wc-chat-pdf-py-willis\layout-pdf-reader.py", line 2, in
from llama_index.readers.schema.base import Document
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index_init_.py", line 17, in
from llama_index.embeddings.langchain import LangchainEmbedding
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index\embeddings_init_.py", line 7, in
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index\embeddings\azure_openai.py", line 3, in
from openai import AsyncAzureOpenAI, AzureOpenAI
ImportError: cannot import name 'AsyncAzureOpenAI' from 'openai' (C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai_init_.py)

timeout error?

I am getting this error while reading local pdf (quite bit though). FWIW, Unstructured API managed it well.


File [/opt/homebrew/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:41](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:41), in LayoutPDFReader.read_pdf(self, path_or_url)
     39 parser_response = self._parse_pdf(pdf_file)
     40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
     42 return Document(blocks)

KeyError: 'result'```

Connection Error

I am using latest LLLSherpa to chunk a pdf but I always get this SSLCertVerificationError. I am using python 3.12 and using simple code llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url). Looks like it is a known issue and could have been resolved by disabling SSL check but I could not find anyway to handle it as connections are made by LayoutPDFReader with no handle to disable SSL check. Please guide me.

Error Details:
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)

MaxRetryError: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: /api/document/developer/parseDocument?renderFormat=all (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))

APIConnectionError: Connection error.


LocalProtocolError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/httpcore/_exceptions.py in map_exceptions(map)
9 try:
---> 10 yield
11 except Exception as exc: # noqa: PIE786

106 frames
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

During handling of the above exception, another exception occurred:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

During handling of the above exception, another exception occurred:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

During handling of the above exception, another exception occurred:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

APIConnectionError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/openai/_base_client.py in _request(self, cast_to, options, remaining_retries, stream, stream_cls)
903 )
904
--> 905 raise APIConnectionError(request=request) from err
906
907 log.debug(

APIConnectionError: Connection error.

LayoutPDFReader_Demo.ipynb test Error confirmation request and fine-tuning section GPT multilingual translation request code

I am reporting a malfunction while testing based on LayoutPDFReader_Demo.ipynb.

1. PDF download recognition fails from external URL

pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"

,,,
UnboundLocalError Traceback (most recent call last)
in <cell line: 6>()
4 pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
5 pdf_reader = LayoutPDFReader(llmsherpa_api_url)
----> 6 doc = pdf_reader.read_pdf(pdf_url)
,,,

2. Read file from inside (success)
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" Download.
Uploaded “1910.13461.pdf” to the “downloads” folder.

3. Question: What code should I enter to request translation of the fine-tuning section text into another language through GPT?

,,,
from IPython.core.display import display, HTML
selected_section = None
// find a section in the document by title
for section in doc.sections():
if section.title == '3 Fine-tuning BART':
selected_section = section
break
HTML(section.to_html(include_children=True, recurse=True))
,,,

4. custom summary of this text using a prompt: (Error)

resp = OpenAI().complete(f"read this text and answer question: {question}:\n{context}")

LocalProtocolError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/httpcore/_exceptions.py in map_exceptions(map)
9 try:
---> 10 yield
11 Exception as exc: # noqa: PIE786

I also tried asking a bug question about GPT, but couldn't find a suitable fix, so I'm leaving my question here.

Add more metadata info - page label and filename?

I'm trying to use it as a pdf reader for llama index, which usually also has details like page label with each document. Anyway to add that info too? How would I go about customizing it to do that myself?

pdf problem?

i scan a document and pdf allows copy and paste the text but i get this error with layoutpdf

     39 parser_response = self._parse_pdf(pdf_file)
     40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
     42 return Document(blocks)

KeyError: 'result'```
is it because of the format? is there anything i could do to turn the pdf into a more readable format for your api? 

llmsherpa url error

Hi! I'm trying to parse a pdf like the example:

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

But I'm receiving the error:
MaxRetryError: HTTPSConnectionPool(host='arxiv.org', port=443): Max retries exceeded with url: /pdf/1910.13461.pdf (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000001BEE59F6410>: Failed to resolve 'arxiv.org' ([Errno 11002] getaddrinfo failed)"))

I've also tried with local files, so I think my problem is related to the API. Does anyone know how to solve this?

Add async API

It would be great if read_pdf supported an async variant so we could await the result. This would allow us to easily perform concurrent work while waiting the multiple seconds the API can take to respond.

Bug in API function: Repeated content in HTML when trying to convert PDF to HTML.

First of all, I would like to appreciate the great work, you have done to convert PDF to well tagged HTML pages. Many Thanks for this contribution.

The Issue I faced that I am getting repeated pages while converting pdf to HTML . ..
To recreate the issue, use following code...

Use this file to get code with indentations.
pdf_to_html_llmsherpa.txt

Actual code --

def convert_pdf_to_html(pdf_file, output_html):
try:
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_file)
print(doc.to_html())

    # Write to a html file
    with open(output_html, 'w', encoding='utf-8') as html_file:
        html_file.write(f'{doc.to_html()}')

    print(f"Conversion successful. HTML file saved to {output_html}")
except Exception as e:
    print(f"Error during conversion: {e}")

pdf_file_path = 'pdf_upload/AbanPearlPteLtd310322.pdf'
output_html_path = 'pdf_upload/AbanPearlPteLtd310322_modified_2.html'

convert_pdf_to_html(pdf_file_path, output_html_path)

AbanPearlPteLtd310322.pdf

Parse nodes on a para-point level

Hi,

I'm trying to parse a document which has a lot of points which in turn has sub points. Goal is to split the text point-wise and parse them as llama-index nodes.
For Example,
I would like to have this as a single node:

Screenshot 2024-02-15 123256

However, when I parse and iterate through chunks (doc.chunks()), the heirarchy for points and subpoints aren't getting assigned.

All these chunks are independent and have no relationship with each other other than with the section heading:

Screenshot 2024-02-15 1232562

Based on my understanding, we can probably try the following:

  1. Manually Assign the parent node (para) to the 4 sub points (lists)
  2. Parse the document into nodes on a section level and then use sentence splitters using llama index API (might not be optimal).

Kindly let me know if there's any alternatives for this.

Thanks!

Bug in API function: Incorrect behavior with repeated sections.

The issue arises when extracting HTML content from a document using the .to_html() method after reading a PDF with

doc = pdf_reader.read_pdf(pdf_url)
doc.to_html(include_children=True, recurse=True)

When iterating through the sections, the loop processes both the parent and child sections, causing repetitive content in the HTML output.
Resulting in unintended duplication.

Here is the relevant code:

    def to_html(self):
        """
        Returns html for the document by iterating through all the sections
        """
        html_str = "<html>"
        for section in self.sections():
            html_str = html_str + section.to_html(include_children=True, recurse=True)
        html_str = html_str + "</html>"
        return html_str

Inquiry about Open-Sourcing the Entire Project

Hello, @ansukla

I've taken a keen interest in your project and am considering its integration into our product. However, we have concerns about potential issues with the provided API and would ideally prefer to deploy it locally to ensure stability and performance.

With this in mind, I'd like to inquire if there are any plans to open-source the entire project. If so, could you provide an estimated timeline for when this might be available?

Your guidance on this would be greatly appreciated.

Thank you for your time and consideration!

Best regards,

Feature Request - API Call Parameters to set chunk minimum and maximum length.

Hello, looks great so far. Would appreciate the ability to include parameters in the API call that specify both a minimum and maximum character length for the resulting chunks.

  • Minimum chunk size would look across the resulting chunk objects and do simple concatenation until they are over some value for total text length.
  • Maximum chunk size would split chunk text into multiple segments while preserving the title/section smart labeling you already do.

Feature Request - Splitting Bounding Boxes Across Pages

I am not sure if this is being handled. During my test I found that box coordinates does not take in to account of intersection.

I am writing to propose a feature enhancement for your project, specifically regarding the current handling of bounding boxes (bbox) in the context of PDF generation.

  1. Currently, when generating a PDF, a single bbox is produced for the chunk of text located in the intersection of the page. While this approach is effective, I would like to suggest an enhancement that involves splitting the bbox into different pages, providing more granularity in representing the layout of text across pages.

  2. Second feature is to add the PDF (mediabox, cropbox or rect) width and height on every page in the API response. This will provide a much better usabilty of the bbox to add annotation/highlight layer using the bbox

bbox interdection

when trying to load multiple documents with joblib, get error cannot pickle

I am trying to parallelize ingestion of multiple, locally-stored PDFs, in my vectorstore.

when trying to load multiple documents with joblib, get error cannot pickle

PicklingError: Could not pickle the task to send it to the workers.

is this because of the API call involving accessing an external server for every PDF I am loading with llmsherpa?
What would be a workaround for this? Making this async (if yes, how)?

I think this is important for production.

thank you

Bug in load_data when using full path

This code would fail:

full_path = 'C:\\temp\\A\\test.pdf'
documents = pdf_loader.load_data(full_path )

However, if relative path is given it works fine.

It looks like the issue is in file_reader.py:63
is_url = urlparse(path_or_url).scheme != ""

In case of full path the scheme will be the letter of the drive (C in this case) which would make it treat it as a URL instead of a path.

Missing Urllib3 Dependency

Still experimenting with the library, but it looks great so far. It looks like urllib3 is missing from the required dependencies. I opened a PR #1 to add this into setup.py

Not able to get all the subsection names inside a section

Hi,I am using the attached pdf for testing.There is no whitespace between subsection title and subsection content.It is not able to extract all the subsection titles present within a section.I tried with a different pdf where white space is there ,It was working pretty good.Could you please guide how we can extract specific subsection title along with corresponding content ?
RWXcE3.pdf.pdf

list of pdfs as input

pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"

does it support list or only a single pdf?

Failed to establish a new connection: [Errno 111] Connection refused'))

I used the test_llmsherpa_api.ipynb file but got a connection error.

WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8910>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8a90>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8c70>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes
---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/urllib3/connection.py](https://localhost:8080/#) in _new_conn(self)
    202         try:
--> 203             sock = connection.create_connection(
    204                 (self._dns_host, self.port),

20 frames
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

NewConnectionError                        Traceback (most recent call last)
NewConnectionError: <urllib3.connection.HTTPConnection object at 0x78a6541b8e20>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py](https://localhost:8080/#) in increment(self, method, url, response, error, _pool, _stacktrace)
    513         if new_retry.is_exhausted():
    514             reason = error or ResponseError(cause)
--> 515             raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    516 
    517         log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPConnectionPool(host='localhost', port=5001): Max retries exceeded with url: /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8e20>: Failed to establish a new connection: [Errno 111] Connection refused'))`

```

just want to get the paragraph information

The project's ability to obtain text paragraph information from PDF files is exciting. If I just want to get the paragraph information inside the PDF instead of using LLM, can I do it without requesting the API? Or can you recommend some related technologies and warehouses to me?

keyerror: result

Running test script in colab:

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "dagpenge_LH_merged.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

returns

KeyError Traceback (most recent call last)
in <cell line: 6>()
4 pdf_url = "dagpenge_LH_merged.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
5 pdf_reader = LayoutPDFReader(llmsherpa_api_url)
----> 6 doc = pdf_reader.read_pdf(pdf_url)

/usr/local/lib/python3.10/dist-packages/llmsherpa/readers/file_reader.py in read_pdf(self, path_or_url)
39 parser_response = self._parse_pdf(pdf_file)
40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
42 return Document(blocks)
43 # def read_file(file_path):

KeyError: 'result'

I often get this error when trying to run the demo script. It also occurred yesterday, but then running it a few times "solved" the issue. It does not now.

image

Parsing bytes-like objects directly

Hi,

I've been experimenting with llmsherpa with a small streamlit app to which I upload PDFs, and I saw that I can't use the uploaded files (which are bytes-like) directly.

I have code to handle that directly, though I wonder if this feature is needed by anyone else... If there's interest I can open a PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.