Comments (20)
If it possible could you give me your PDF url?
How about opening with guess=False
option like:
read_pdf_table("TAJ.pdf", guess=False)
from tabula-py.
please find the link here
http://datasheets.avx.com/TAJ.pdf
from tabula-py.
read_pdf_table("TAJ.pdf", guess=False) did not work for me, any other way of giving the pdf as input, am i using the correct means to execute the tabula
#!/usr/bin/python
#!/usr/bin/perl
#!/usr/bin/perl -d:ptkdb
import fileinput, sys, os ,subprocess, io
from tabula import read_pdf_table
df=read_pdf_table("TAJ.pdf")
from tabula-py.
In current version, tabula-py doesn't handle Java exception well. tabula-py depends on tabula-java. I tried to read with tabula-java and I found the PDF is too complex to parse whole PDF with tabula-java.
I got following error with tabula-java:
$ java -jar tabula-0.9.1-jar-with-dependencies.jar -g TAJ.pdf 19:22:51
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Nov 05, 2016 7:25:56 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.pdmodel.graphics.color.PDSeparation createColorModel
INFO: About to create ColorModel for ICCBased{numberOfComponents: 3}
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
SEVERE: java.lang.IllegalArgumentException: Map size (0) must be >= 1
java.lang.IllegalArgumentException: Map size (0) must be >= 1
at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:335)
at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:287)
at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:169)
at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:146)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.getRGBImage(PDCcitt.java:189)
at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:96)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at technology.tabula.ObjectExtractor.drawPage(ObjectExtractor.java:153)
at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:108)
at technology.tabula.PageIterator.next(PageIterator.java:29)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:144)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
WARNING: java.lang.NullPointerException
java.lang.NullPointerException
at technology.tabula.ObjectExtractor$PointComparator.compare(ObjectExtractor.java:410)
at technology.tabula.ObjectExtractor.strokeOrFillPath(ObjectExtractor.java:254)
at technology.tabula.ObjectExtractor.strokePath(ObjectExtractor.java:275)
at org.apache.pdfbox.util.operator.pagedrawer.StrokePath.process(StrokePath.java:47)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:537)
at org.apache.pdfbox.util.operator.CloseAndStrokePath.process(CloseAndStrokePath.java:45)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at technology.tabula.ObjectExtractor.drawPage(ObjectExtractor.java:153)
at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:108)
at technology.tabula.PageIterator.next(PageIterator.java:29)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:144)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.pdmodel.graphics.color.PDSeparation createColorModel
INFO: About to create ColorModel for ICCBased{numberOfComponents: 3}
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
SEVERE: java.lang.IllegalArgumentException: Map size (0) must be >= 1
java.lang.IllegalArgumentException: Map size (0) must be >= 1
at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:335)
at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:287)
at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:169)
at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:146)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.getRGBImage(PDCcitt.java:189)
at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:96)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:93)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:161)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.pdmodel.graphics.color.PDSeparation createColorModel
INFO: About to create ColorModel for ICCBased{numberOfComponents: 3}
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
SEVERE: java.lang.IllegalArgumentException: Map size (0) must be >= 1
java.lang.IllegalArgumentException: Map size (0) must be >= 1
at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:335)
at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:287)
at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:169)
at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:146)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.getRGBImage(PDCcitt.java:189)
at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:96)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:103)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:161)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Nov 05, 2016 7:25:59 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
"",Code,EIA S Min.Code,EIA Metric,L±0.20 (0.008),,W+0.20 (0.008) -0.10 (0.004),,H+0.20 (0.008) -0.10 (0.004),,W1±0.20 (0.008),,A+0.30 (0.012)-0.20 (0.008),,,
"",A,1206,3216-18,3.20 (0.126),,1.60 (0.063),,1.60 (0.063),,1.20 (0.047),,0.80 (0.031),,,1.10 (0.043)
"",B,1210,3528-21,3.50 (0.138),,2.80 (0.110),,1.90 (0.075),,2.20 (0.087),,0.80 (0.031),,,1.40 (0.055)
"",C,2312,6032-28,6.00 (0.236),,3.20 (0.126),,2.60 (0.102),,2.20 (0.087),,1.30 (0.051),,,2.90 (0.114)
MARKING,D,2917,7343-31,7.30 (0.287),,4.30 (0.169),,2.90 (0.114),,2.40 (0.094),,1.30 (0.051),,,4.40 (0.173)
"",E,2917,7343-43,7.30 (0.287),,4.30 (0.169),,4.10 (0.162),,2.40 (0.094),,1.30 (0.051),,,4.40 (0.173)
"A, B, C, D, E, U, V CASE",,,,,,,,,,,,,,,
"",U,2924,7361-43,7.30 (0.287),,6.10 (0.240),,4.10 (0.162),,3.10 (0.120),,1.30 (0.051),,,4.40 (0.173)
AVX LOGO Capacitance Value in pF,,,,,,,,,,,,,,,
...snip...
XXXXX ID Code,,
HOW TO ORDER,,
TAJ C 106,M 035 R NJ,—
Type Case Size Capacitance Code,Tolerance Rated DC Voltage Packaging Specification,Additional
See table pF code: 1st two,"K = ±10% 002 = 2.5Vdc R = Pure Tin 7"" Reel Suffix",characters may be
above digits represent,"M = ±20% 004 = 4Vdc S = Pure Tin 13"" Reel NJ = Standard",added for special
significant figures,"006 = 6.3Vdc A = Gold Plating 7"" Reel Suffix",requirements
3rd digit represents,"010 = 10Vdc B = Gold Plating 13"" Reel",V = Dry pack Option(selected codes only)
One thing you should try is opening with page
option.
With tabula-java, I can extract with -p
option as following:
$ java -jar tabula-0.9.1-jar-with-dependencies.jar -g -p 3-6 TAJ.pdf
Nov 05, 2016 7:27:30 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:31 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:33 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:34 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:35 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:36 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:38 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:38 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:39 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
"",,,,,,,,,ESR,,
AVX,Case,Capacitance,Rated,Rated,Category Category,,DCL,DF,Max.,,100kHz RMS Current (mA)
Part No.,Size,(μF),Voltage,Temperature,Voltage Temperature,,Max.,Max.,@ 100kHz,MSL,
"",,,(V),(oC),(V) (oC),,(μA),(%),(Ω),,25oC 85oC 125oC
"",,,,,2.5 Volt @ 85°C,,,,,,
TAJA336*002#NJ,A,33,2.5,85,1.7 125,,0.8,8,1.7,1,210 189 84
TAJA476*002#NJ,A,47,2.5,85,1.7 125,,0.9,6,3,1,158 142 63
TAJA686*002#NJ,A,68,2.5,85,1.7 125,,1.4,8,1.5,1,224 201 89
TAJA107*002#NJ,A,100,2.5,85,1.7 125,,2.5,30,1.4,1,231 208 93
...snip...
TAJE156*050#NJ,E,15,50,85,33 125,,7.5,6,0.6,11),524 472 210
TAJV156*050#NJ,V,15,50,85,33 125,,7.5,6,0.6,11),645 581 258
TAJV226*050#NJ,V,22,50,85,33 125,,11,8,0.6,11),645 581 258
Unfortunately, current version of tabla-py has a restriction( #2 ) not to handle multiple tables at once, I got same error you got with same option tabla-java worked.
In [3]: from tabula import read_pdf_table
In [4]: df = read_pdf_table("./TAJ.pdf", guess=False, pages="3-6")
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-4-1ebe490eea51> in <module>()
----> 1 df = read_pdf_table("./TAJ.pdf", guess=False, pages="3-6")
/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, options, pages, guess, area, spreadsheet, password, nospreadsheet, silent)
79 return
80
---> 81 return pd.read_csv(io.BytesIO(output))
/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
643 skip_blank_lines=skip_blank_lines)
644
--> 645 return _read(filepath_or_buffer, kwds)
646
647 parser_f.__name__ = name
/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
398 return parser
399
--> 400 data = parser.read()
401 parser.close()
402 return data
/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
936 raise ValueError('skipfooter not supported for iteration')
937
--> 938 ret = self._engine.read(nrows)
939
940 if self.options.get('as_recarray'):
/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
1503 def read(self, nrows=None):
1504 try:
-> 1505 data = self._reader.read(nrows)
1506 except StopIteration:
1507 if self._first_chunk:
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:9884)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:25878)()
CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 14
from tabula-py.
where should i pass the -p option in tabula-py?
from tabula-py.
can you please also let me know what all i need to install to run tabula-java and from which command prompt are you running the below command? i have cygwin installed on windows
java -jar tabula-0.9.1-jar-with-dependencies.jar -g -p 3-6 TAJ.pdf
from tabula-py.
request inputs from anyone
from tabula-py.
Did you see tabula-java repo?
https://github.com/tabulapdf/tabula-java
You can download from here.
https://github.com/tabulapdf/tabula-java/releases
from tabula-py.
thanks for the link , what are the steps to install this package?
from tabula-py.
Can someone please tell me the difference between STREAM MODE & LATTICE MODE OF TABULA and in which options Lattice mode is to be applied?
from tabula-py.
I made a workaround you can use.
What I did was to place the jar-file in the project root, and use 'subprocess.call' to call the jar-file.
This solution works pretty well for me. It gives some errors and warnings, hence the stderr to devnull, but it works. Here's my method:
def convert_pdf_to_csv(self, file):
outfile = file.replace("pdf", self.OUTPUT_FORMAT)
subprocess.call(['java', '-jar', self.JARFILE, '-o', outfile, '-f', self.OUTPUT_FORMAT, '-g', '-r', file], stdout=open(os.devnull, 'w'), stderr=subprocess.STDOUT)
from tabula-py.
thanks for the reply, can you please give a bit more explanation of what is happenning onsubprocess.call(['java', '-jar', self.JARFILE, '-o',
from tabula-py.
Referencing https://github.com/tabulapdf/tabula-java
I'm pretty new to python, but afaik subprocess.call just lets you call other processess from within python.
self.JARFILE is the absolute path of "tabula-0.9.1-jar-with-dependencies.jar", because somehow, it wouldn't work with relative path.
from tabula-py.
thanks for the reply and what about self.OUTPUT_FORMAT? is it setting the output format to PDF? can complex pdfs be converted through this means ? also how can i pass the mode type STREAM MODE or LATTICE MODE
from tabula-py.
output_format is just a variable i use for setting the output format with the -f-option. Please read the link I provided to see which ones you can use for your project. I havent seen anything to do with stream of lattice mode.
from tabula-py.
@sidmohan I'm sorry, I made a bug and fixed it. f1db4ef
Could you upgrade your tabula-py?
from tabula-py.
Ok thanks how do you want us to try this?
from tabula-py.
As following:
df = read_pdf_table("./TAJ.pdf", guess=False, pages="3-6")
Of course, there might be still some problems related to multiple tables. You should have some restrictions of pages and areas to extract what you want. To know options for restriction in detail, you must read tabula-java Readme.
https://github.com/tabulapdf/tabula-java/blob/master/README.md
or, I think you can ask in gitter chat room of tabula-java
https://gitter.im/tabulapdf/tabula-java?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge
from tabula-py.
I am facing a similar issue.
tabula-py works fine on jupyter notebook but when I run the same code on command line then the error occurs unsupported/disabled operation BDC.
Please help.
from tabula-py.
@nitinvijay23 This issue had already been closed. Could you create a new issue with more detail?
from tabula-py.
Related Issues (20)
- Unable to remove note in log : Got stderr: Picked up _JAVA_OPTIONS: -Djava.awt.headless=true HOT 1
- Tabula py Ignores an entire column if it's blank and if it does not contain headerd? HOT 1
- tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', HOT 3
- dont ignore empty columns in tables spanning multiple pages HOT 1
- Try to install tabula-py HOT 1
- Use JPype instead of subprocess HOT 11
- Add a way to set areas for non-existent pages in template HOT 4
- Exception: RuntimeError: java.lang.UnsatisfiedLinkError: HOT 2
- cant install tabula-py on m1 mac vscode. HOT 1
- Support Python 3.12 HOT 5
- Pls add "orientation" parameter to read_pdf HOT 4
- Security vulnerability in tabula-1.0.5-jar-with-dependencies.jar HOT 4
- [BUG] Encoding still being overridden even after fix to #371. HOT 5
- FutureWarning: errors='ignore' is deprecated and will raise in a future version. HOT 3
- Unable to detect table with longer header information HOT 4
- [BUG] issue just running sample code HOT 1
- Table detection in images HOT 1
- [BUG] <FutureWarning: errors='ignore' > HOT 3
- [BUG] Error importing jpype dependencies. Fallback to subprocess. No module named 'org.apache' HOT 1
- [BUG] column parameter of read_pdf currently needs to be list, not generic iterable HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabula-py.