vitali84 / pdf-to-csv-table-extactor Goto Github PK

View Code? Open in Web Editor NEW

124.0 124.0 48.0 180 KB

Extract tables from scanned documents pdf into csv file using ocr and image processing

License: Do What The F*ck You Want To Public License

Python 100.00%

pdf-to-csv-table-extactor's People

Contributors

Stargazers

Watchers

Forkers

alwc yamommad yeayee hitman56 vivek100041 willkill420 longzhi-aa oshelly idhruvc marvin106722 boatmonk yousoferfani dragie neowalter gavenswang hkurra joshzyj eastonsuo penroselearning licshire vaibhavbaswal95 swmah88 rajesh67 ipadawan captainstabs zebiao1998 metouitude zwphit ladechan chiefkana kp-forks f-emm bellyfat lanxuanete ltttdh masteryeda kalntera tiwanacote nickvanraaijt yerbymatey hannah-murray joanpaon biakota sshuster kiristern

pdf-to-csv-table-extactor's Issues

Getting error while processing the pdf as per the instruction given

Below is the log trace

%%python pdf-to-csv-cv.py -p test.pdf

Traceback (most recent call last):
File "pdf-to-csv-cv.py", line 213, in
process_file(file_name)
File "pdf-to-csv-cv.py", line 37, in process_file
extracted_table = extract_main_table(gray_image)
File "pdf-to-csv-cv.py", line 70, in extract_main_table
cnts = sorted(cnts, key=cv2.contourArea, reverse=True)
cv2.error: OpenCV(4.1.1) C:\projects\opencv-python\opencv\modules\imgproc\src\shapedescr.cpp:274: error: (-215:Assertion failed) npoints >= 0 && (depth == CV_32F || depth == CV_32S) in function 'cv::contourArea'

Traceback (most recent call last):

File "", line 1, in
get_ipython().run_cell_magic('python', 'pdf-to-csv-cv.py -p test.pdf', '\n')

File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2358, in run_cell_magic
result = fn(*args, **kwargs)

File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\magics\script.py", line 142, in named_script_magic
return self.shebang(line, cell)

File "<C:\ProgramData\Anaconda3\lib\site-packages\decorator.py:decorator-gen-111>", line 2, in shebang

File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\magic.py", line 187, in
call = lambda f, *a, **k: f(*a, **k)

File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\magics\script.py", line 245, in shebang
raise CalledProcessError(p.returncode, cell, output=out, stderr=err)

CalledProcessError: Command 'b'\n'' returned non-zero exit status 1.

Please help to solve the issue

Namespace(pdf=['test.pdf'])

Getting Namespace(pdf=['test.pdf']) output while running it

Table data not getting extracted for some pdf's

Hi Vitali,
I'm facing issue with some of the PDF's which are similar in format.
For some data is getting extracted and for some not.
If you could please help me with this.

File Working fine:
imgpsh_fullsize_anim-converted.pdf

File Not Working(Both have different issues, one give error and second has inconsistent response):
ALECO_EXIM-LISTA_ANGAJATI.pdf
Lista_persoane_somaj_tehnic_MI_ANDRE_BELLA.pdf

Landscape

Hi @vitali84 ,
Thank you for sharing your work. How can I make this repo work on Landscape PDF?. Can you please point to the line which can do a rotation of the PDF?.

ap = argparse.ArgumentParser() ap.add_argument("--pdf",type=str, nargs='*', required=False, default=)

ap = argparse.ArgumentParser()
ap.add_argument("--pdf",type=str, nargs='*', required=False,
                default="C:/Users/李岳锋/Desktop/10.10/300007汉威科技：2019年年度报告.pdf")

这个文件路径的写法，请问能不能写的清楚点呢，跑脚本总是出错呢。

minecart and cv2, even after i have installed in thewindows OS am getting the module error not found can youplease help me starting with

install minecart

Hi
I find a an error while installing the minecart

pdfminer.pdftypes.PDFNotImplementedError: Colorspace 'ICCBased' is not supported

Hi,

I am using this module to read the PDF file but I faced this when call im = page.images[0].as_pil()
After checked, I saw the module does not support Colorspace ICCBased in minecart.content.Image
Any solution to convert ICCBased to RGB before call as_pil or anything?

Thanks and Regards,
Binh Nguyen

vitali84 / pdf-to-csv-table-extactor Goto Github PK

pdf-to-csv-table-extactor's People

Contributors

Stargazers

Watchers

Forkers

pdf-to-csv-table-extactor's Issues

Getting error while processing the pdf as per the instruction given

Below is the log trace

CalledProcessError: Command 'b'\n'' returned non-zero exit status 1.

Namespace(pdf=['test.pdf'])

Table data not getting extracted for some pdf's

Landscape

ap = argparse.ArgumentParser() ap.add_argument("--pdf",type=str, nargs='*', required=False, default=)

minecart and cv2, even after i have installed in thewindows OS am getting the module error not found can youplease help me starting with

install minecart

pdfminer.pdftypes.PDFNotImplementedError: Colorspace 'ICCBased' is not supported

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent