This Repository contain code for extracting details such as author, author institution, companies, target price of company, BUY/SELL call from financial PDF documents.
Use the package manager pip to install foobar.
apt install tesseract-ocr
apt-get install poppler-utils
pip install -r requirements.txt
python -m spacy download en_core_web_trf
python -m spacy download en_core_web_sm
Also install tesseract on your Windows device and add the path to the script with
import pytesseract
pytesseract.pytesseract.tesseract_cmd = (
# path to .exe file in windows
r"C:\Users\user\Programs\Tesseract-OCR\tesseract.exe"
# Linux('which tesseract' to get the path, after installing tesseract)
r"/usr/bin/tesseract"
)
NOTE: pytesseract is only necessary for methods using Tesseract-OCR.
from ImgProcess import ImageTools
# split the document image into region of interest
# avoid useless parts of the document
pdf_image = 'path to image of document'
img_tool = ImageTools()
doc_imgs = img_tool(pdf_image)
from reader import PDFReader
# returns the text content in a PDF file using ImageTools
# 3 available methods
# - pdfplumber
# - pytesseract
# - pytesseract_split
reader = PDFReader(pdf_method='tesseract_split')
text_content = reader('path_to_pdf_document')
from NER import EntityRecognition
# extracts the details from the text content
ER = EntityRecognition(pdf_method='tesseract_split')
author_institution, author, companies, target = ER(text_content)
# Extract details from a single pdf file
python main.py ----pdf_method='tesseract_split' --pdf_file='path_to_pdf'
# Extract details from a directory of pdf files
python main.py ----pdf_method='tesseract_split' --pdf_dir='path_to_pdf_dir'
# Extract details from a directory of pdf files to CSV file
python results.py ----pdf_method='tesseract_split' --pdf_dir='path_to_pdf_dir' --csv_path='path_to_csv'
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.