vitali84 / pdf-to-csv-table-extactor Goto Github PK
View Code? Open in Web Editor NEWExtract tables from scanned documents pdf into csv file using ocr and image processing
License: Do What The F*ck You Want To Public License
Extract tables from scanned documents pdf into csv file using ocr and image processing
License: Do What The F*ck You Want To Public License
%%python pdf-to-csv-cv.py -p test.pdf
Traceback (most recent call last):
File "pdf-to-csv-cv.py", line 213, in
process_file(file_name)
File "pdf-to-csv-cv.py", line 37, in process_file
extracted_table = extract_main_table(gray_image)
File "pdf-to-csv-cv.py", line 70, in extract_main_table
cnts = sorted(cnts, key=cv2.contourArea, reverse=True)
cv2.error: OpenCV(4.1.1) C:\projects\opencv-python\opencv\modules\imgproc\src\shapedescr.cpp:274: error: (-215:Assertion failed) npoints >= 0 && (depth == CV_32F || depth == CV_32S) in function 'cv::contourArea'
Traceback (most recent call last):
File "", line 1, in
get_ipython().run_cell_magic('python', 'pdf-to-csv-cv.py -p test.pdf', '\n')
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2358, in run_cell_magic
result = fn(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\magics\script.py", line 142, in named_script_magic
return self.shebang(line, cell)
File "<C:\ProgramData\Anaconda3\lib\site-packages\decorator.py:decorator-gen-111>", line 2, in shebang
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\magic.py", line 187, in
call = lambda f, *a, **k: f(*a, **k)
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\magics\script.py", line 245, in shebang
raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
Please help to solve the issue
Getting Namespace(pdf=['test.pdf']) output while running it
Hi Vitali,
I'm facing issue with some of the PDF's which are similar in format.
For some data is getting extracted and for some not.
If you could please help me with this.
File Working fine:
imgpsh_fullsize_anim-converted.pdf
File Not Working(Both have different issues, one give error and second has inconsistent response):
ALECO_EXIM-LISTA_ANGAJATI.pdf
Lista_persoane_somaj_tehnic_MI_ANDRE_BELLA.pdf
Hi @vitali84 ,
Thank you for sharing your work. How can I make this repo work on Landscape PDF?. Can you please point to the line which can do a rotation of the PDF?.
ap = argparse.ArgumentParser()
ap.add_argument("--pdf",type=str, nargs='*', required=False,
default="C:/Users/李岳锋/Desktop/10.10/300007汉威科技:2019年年度报告.pdf")
这个文件路径的写法,请问能不能写的清楚点呢,跑脚本总是出错呢。
minecart and cv2, even after i have installed in thewindows OS am getting the module error not found can youplease help me starting with
Hi,
I am using this module to read the PDF file but I faced this when call im = page.images[0].as_pil()
After checked, I saw the module does not support Colorspace ICCBased
in minecart.content.Image
Any solution to convert ICCBased to RGB before call as_pil
or anything?
Thanks and Regards,
Binh Nguyen
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.