python-fan / pdf2word Goto Github PK
View Code? Open in Web Editor NEW60行代码实现多线程PDF转Word
License: MIT License
60行代码实现多线程PDF转Word
License: MIT License
我转的是一份具有三列分栏的PDF文档,转换后需要手动分段,不过大部分单词都是正确的。多谢啦Y(^_^)Y
python版本3.8.5,pip版本20.1.1。
在安装依赖是报错:
ERROR: Could not find a version that satisfies the requirement pdfminer3k==1.3.1 (from -r requirements.txt (line 3)) (from versions: 1.3.2, 1.3.3, 1.3.4)
ERROR: No matching distribution found for pdfminer3k==1.3.1 (from -r requirements.txt (line 3))
把pdfminer3k==1.3.1改为pdfminer3k==1.3.4后可以成功安装,并且能成功运行。
正在处理: Opportunities for using Navy marine mammals to explore associations between organochlorine contaminants and unfavorable effects on reproduction.pdf
WARNING:root:Cannot locate objid=67
WARNING:root:Wrong type: 0 required: <class 'dict'>
WARNING:root:Catalog not found!
执行Python main.py报这个错,ImportError: No module named 'pdfminer'
应该怎么解决
Could not find a version that satisfies the requirement pdfminer3k==1.3.1 (from -r requirements.txt (line 3)) (from versions: 1.3.2, 1.3.3, 1.3.4)
No matching distribution found for pdfminer3k==1.3.1 (from -r requirements.txt (line 3))
运行完 还是一堆图片 扫描版的pdf识别不了
PDF 中包含中文,转换失败
PDFDocEncoding = ''.join( chr(x) for x in (
ValueError: chr() arg not in range(256)
报这个错怎么回事啊?
ImportError: cannot import name 'process_pdf' from 'pdfminer.pdfinterp' (/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py)
λ python main.py
正在处理: 4月报销.pdf
WARNING:root:UniGB-UCS2-H
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 1050
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 2264
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 4409
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 4532
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 3493
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 1480
主函数中:
1.避免输出时输出一堆找不到字体warning
import logging
logging.Logger.propagate = False
logging.getLogger().setLevel(logging.ERROR)
2.pdf转成txt之后至少进行一些正则匹配,下面是我加的
def pdf_to_word(pdf_file_path, word_file_path):
content = read_from_pdf(pdf_file_path)
content = re.compile(r'([0-9a-zA-Z_])\n([0-9a-zA-Z_])').sub(r'\1 \2', content)
content0 = re.compile(r'(-)\n([0-9a-zA-Z_])').sub(r'\2', content)
content1 = re.compile(r' \n ').sub(r'', content0)
content_2 = re.compile(r'([^.])\n').sub(r'\1', content1)
content_compile = re.compile(r'\(cid:\d{1,2}\)').sub(r'', content_2)
save_text_to_word(content_compile, word_file_path)
Cannot locate objid=3886。无法定位异常位置。
运行"python3 main.py " 或者 "python main.py"
正在处理: aa.pdf
完成
尽管运行的结果是这样的,我在words的文件夹下没有找到word文件啊...
config.cfg中我也进行了如下的编辑
[default]
pdf_folder=/home/ffz/Projects/pdfs
word_folder=/home/ffz/Projects/words
max_worker=5
win系统好像用不了source就没写source下面的那几行代码,修改了下config就直接跑程序,就出现如题了
无法导入process_pdf包,怎么解决
pdf中包含图片,转成word之后,在word文档中没有图片,请问怎么处理
那种扫描类型的pdf无法转成word
你好呀,转换了一下 arXiv 的几篇技术论文,发现图片没了,公式有些会乱掉,应该是不支持。不知道这部分功能能不能改进呢?
问下pdfminer的版本是多少啊,谢谢
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.