Comments (5)
@xuxoramos Thanks for reporting it.
Can you paste the actual code and full error message without trimming? I can't reproduce your error on my end.
Also, can you tell me how to install tabula-py? Please share me pip freeze
output. I'm wondering if you installed it with jpype option or not. #371 is a patch for jpype, and jpype is not installed by default. Note that using jpype doesn't allow to change the encoding in a single Python process. To change it, you need to reboot the Python process.
Here is my result: I tried to parse the PDF you provided. No error happens.
>>> import tabula
>>> tabula.read_pdf("tmp.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1", pages="all")
[ Activity
0 Activity Code Name and Definition Code
1 002 Self-Service AJCC Employment and Workforce...
2 NaN
3 This activity is system generated when an indi...
4 workforce information available in CalJOBS. Wo...
5 as: local performance, availability of support...
6 compensation, and performance and program cost...
7 NaN
...snip...
from tabula-py.
This is the entire error output:
java_options is ignored until rebooting the Python process.
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 dfs = tb.read_pdf("../sourcedata/caljobs_activity_codes_dictionary.pdf",
2 pages="all",
3 encoding="windows-1252",
4 pandas_options={"encoding":"windows-1252"},
5 java_options=["-Dfile.encoding=windows-1252"]
6 )
File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:395, in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, use_raw_url, pages, guess, area, relative_area, lattice, stream, password, silent, columns, relative_columns, format, batch, output_path, force_subprocess, options)
392 raise ValueError(f"{path} is empty. Check the file, or download it manually.")
394 try:
--> 395 output = _run(
396 tabula_options,
397 java_options,
398 path,
399 encoding=encoding,
400 force_subprocess=force_subprocess,
401 )
402 finally:
403 if temporary:
File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:82, in _run(options, java_options, path, encoding, force_subprocess)
79 elif set(java_options) - IGNORED_JAVA_OPTIONS:
80 logger.warning("java_options is ignored until rebooting the Python process.")
---> 82 return _tabula_vm.call_tabula_java(options, path)
File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
115 if result.stderr:
116 logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117 return result.stdout.decode(self.encoding)
118 except FileNotFoundError:
119 raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte
I installed tabula-py
from the wheel file to an Anaconda๐ setup (and hence no requirements.txt
, only a ton of dependencies), on a Windows 10 machine that cannot access the internet (for security reaseons), and hence without jpype
, so the best way to emulate this env would be to complete shut down your network interface and attempt to replicate the error, I guess.
from tabula-py.
Hmm, that sounds weird. I can find that conda-forge's latest version is still v2.7.0. https://anaconda.org/conda-forge/tabula-py
Anyway, your log shows that you are using the subprocess, not jpype. Hence, #371 is unrelated because it is jpype related issue.
Also, I tried Jupyter and ipython on my Windows machine, but I can't reproduce the issue.
In [1]: import tabula
...:
...: tabula.read_pdf("tmp.pdf", pages="all", encoding="windows-1252", pandas_options={"encoding":"windows-1252"},jav
...: a_options=["-Dfile.encoding=windows-1252"])
Error importing jpype dependencies. Fallback to subprocess.
No module named 'jpype'
Out[1]:
[ Activity
0 Activity Code Name and Definition Code
1 002 Self-Service AJCC Employment and Workforce...
2 NaN
...snip...
Does it happen just after launching jupyter/ipython? I guess you changed the encoding
in the same Python process since the error shows as:
File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
115 if result.stderr:
116 logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117 return result.stdout.decode(self.encoding)
118 except FileNotFoundError:
119 raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte
This suggests that restult.stdout.decode(self.encoding)
causes error, i.e., trying decoding with utf-8
. Your log shows the cell number is In[14]
, so I doubt you set utf-8
initially, but you changed to windows-1252
.
After supporting jpype in tabula-py, tabula doesn't allow the change of encoding
argument after the first read_xxx
calling. If you want to change, you can pass force_subprocess=True
option, which recreates SubprocessTabula
instance.
from tabula-py.
Made a potential mitigation on #378. Please try the master branch code and give me a feedback if any.
from tabula-py.
Released https://pypi.org/manage/project/tabula-py/release/2.9.1/
from tabula-py.
Related Issues (20)
- Unable to remove note in log : Got stderr: Picked up _JAVA_OPTIONS: -Djava.awt.headless=true HOT 1
- Tabula py Ignores an entire column if it's blank and if it does not contain headerd? HOT 1
- tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', HOT 3
- dont ignore empty columns in tables spanning multiple pages HOT 1
- Try to install tabula-py HOT 1
- Use JPype instead of subprocess HOT 11
- Add a way to set areas for non-existent pages in template HOT 4
- Exception: RuntimeError: java.lang.UnsatisfiedLinkError: HOT 2
- cant install tabula-py on m1 mac vscode. HOT 1
- Support Python 3.12 HOT 5
- Pls add "orientation" parameter to read_pdf HOT 4
- Security vulnerability in tabula-1.0.5-jar-with-dependencies.jar HOT 4
- FutureWarning: errors='ignore' is deprecated and will raise in a future version. HOT 3
- Unable to detect table with longer header information HOT 4
- [BUG] issue just running sample code HOT 1
- Table detection in images HOT 1
- [BUG] <FutureWarning: errors='ignore' > HOT 3
- [BUG] Error importing jpype dependencies. Fallback to subprocess. No module named 'org.apache' HOT 1
- [BUG] column parameter of read_pdf currently needs to be list, not generic iterable HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabula-py.