gpt-data-extraction's People
Forkers
healthmemmo mecamangeg arthuraraujo krasimirivanov1991 bradleisure az1z1ally karabogerald jorcan kszcode gapilongo david-hoa2023 ia-programming jaredkirby chiwenheng carbongpt laith619 bradsterling macdonkt prathapkr martincooperbiz mexicanamerican mbell195 bartkamphuis thefreshprince01 tonywhite11 foogprety nithin412 nickoyuan mbastar asanchez75 edujugon dpss77 munirabobaker aholtheuerl cyb3rsh3ll mikestromme masondg21 humaira699 pranjalrana11 giousedaponte synnosa yacineali74 wangn25gpt-data-extraction's Issues
open ai rate
everytime i try with a file more than 2 M it shows this error message:
RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-80J2h6neuD6bQemh48GUwxlZ on tokens per min. Limit: 40000 / min. Please try again in 1ms. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method.
PdfiumError: Failed to load document (PDFium: File access error) while uploading the pdf file
I tried running the same python code which you uploaded in this repo, below is the code
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from dotenv import load_dotenv
from pytesseract import image_to_string
from langchain.text_splitter import RecursiveCharacterTextSplitter
from PIL import Image
from io import BytesIO
# import pypdfium2 as pdfium
import pypdfium2 as pdfium
import streamlit as st
import multiprocessing
from tempfile import NamedTemporaryFile
import pandas as pd
import pytesseract
import json
import os
import requests
load_dotenv()
os.environ["OPENAI_API_KEY"] = ""
# 1. Convert PDF file into images via pypdfium2
def convert_pdf_to_images(file_path, scale=300 / 72):
pdf_file = pdfium.PdfDocument(file_path)
page_indices = [i for i in range(len(pdf_file))]
renderer = pdf_file.render(
pdfium.PdfBitmap.to_pil,
page_indices=page_indices,
scale=scale,
)
final_images = []
for i, image in zip(page_indices, renderer):
image_byte_array = BytesIO()
image.save(image_byte_array, format='jpeg', optimize=True)
image_byte_array = image_byte_array.getvalue()
final_images.append(dict({i: image_byte_array}))
return final_images
# 2. Extract text from images via pytesseract
def extract_text_from_img(list_dict_final_images):
image_list = [list(data.values())[0] for data in list_dict_final_images]
image_content = []
for index, image_bytes in enumerate(image_list):
image = Image.open(BytesIO(image_bytes))
raw_text = str(image_to_string(image))
image_content.append(raw_text)
return "\\n".join(image_content)
def extract_content_from_url(url: str):
images_list = convert_pdf_to_images(url)
text_with_pytesseract = extract_text_from_img(images_list)
return text_with_pytesseract
# 3. Extract structured info from text via LLM
def extract_structured_data(content: str, data_points):
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
template = """
You are an expert admin people who will extract core information from documents
{content}
Above is the content; please try to extract all data points from the content above
and export in a JSON array format:
{data_points}
Now please extract details from the content and export in a JSON array format,
return ONLY the JSON array:
"""
prompt = PromptTemplate(
input_variables=["content", "data_points"],
template=template,
)
chain = LLMChain(llm=llm, prompt=prompt)
results = chain.run(content=content, data_points=data_points)
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# chunks = text_splitter.split_documents(content)
# results = [chain.run(content=chunk, data_points=data_points) for chunk in chunks]
return results
# 5. Streamlit app
def main():
default_data_points = """{
"invoice_item": "what is the item that charged",
"Amount": "how much does the invoice item cost in total",
"Company_name": "company that issued the invoice",
"invoice_date": "when was the invoice issued",
}"""
st.set_page_config(page_title="Doc extraction", page_icon=":bird:")
st.header("Doc extraction :bird:")
data_points = st.text_area(
"Data points", value=default_data_points, height=170)
uploaded_files = st.file_uploader(
"upload PDFs", accept_multiple_files=True)
if uploaded_files is not None and data_points is not None:
results = []
for file in uploaded_files:
with NamedTemporaryFile(dir='.', suffix='.csv') as f:
f.write(file.getbuffer())
content = extract_content_from_url(f.name)
print(content)
data = extract_structured_data(content, data_points)
json_data = json.loads(data)
if isinstance(json_data, list):
results.extend(json_data) # Use extend() for lists
else:
results.append(json_data) # Wrap the dict in a list
if len(results) > 0:
try:
df = pd.DataFrame(results)
st.subheader("Results")
st.data_editor(df)
except Exception as e:
st.error(
f"An error occurred while creating the DataFrame: {e}")
st.write(results) # Print the data to see its content
if __name__ == '__main__':
multiprocessing.freeze_support()
main()
But while uploading the pdf file in streamlit app, it is returning below error
PdfiumError: Failed to load document (PDFium: File access error).
Can you please let me know how to fix this error?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.