Comments (11)
Thanks for suggestion. Initially I thought to use py4j, but at that time, tabula-java development was active, so I pended the idea for maintainability.
I'd think tweaking io._run
function and TabulaOption
could help to implement it.
Lines 58 to 102 in b24e3bd
Or, we may be able to do something similar as tabula-rs tabulapdf/tabula-java#444
Anyway, I'm looking forward to contribution for the change.
from tabula-py.
Thanks for the considerations!
Initially I thought to use py4j
I think py4j uses a remote tunnel to operate the JVM, which results in a data transfer penalty, while JPype has a native interface with shared memory approach.
but at that time, tabula-java development was active, so I pended the idea for maintainability.
If you call into the CLI rather than using the actual API, you get the same interface as with subprocess (i.e. no maintainability impact), only in a more performant way.
tabulapdf/tabula-java#444 looks like a nice enhancement that would help skip an unnecessary layer of string serialization.
Anyway, I'm looking forward to contribution for the change.
Disclaimer: Unfortunately, I'm fairly busy with some other projects (pypdfium2 & co.) and thus don't have an intent to work on this myself.
from tabula-py.
@mara004 Thanks for your suggestion. However, I can't reproduce issue by my side and I'm concerned to raise on behalf of them since I don't know what is the trigger.
I released v2.8.2 to automatically fallback to subprocess if there's any import error on jpype.
from tabula-py.
I tried jpype on #355
However, I encountered when I run pytest with multiple files, it always fails.
https://github.com/chezou/tabula-py/actions/runs/5987819184
This is a huge blocker for introducing jpype, and I'm about to give up.
from tabula-py.
I'm still a bit tired, but all the test suite failures seem to be caused by TypeError: _run() takes from 2 to 3 positional arguments but 4 were given
.
That's because the encoding
argument was removed. Assuming that is correct, just adapt the calling code accordingly. Please don't give up just yet and at least leave the PR open as draft so it is more visible for others to take a look.
from tabula-py.
I guess you're trying the old PR. Can you check the new one? #356
from tabula-py.
I don't currently have the env to test this locally, I can only try to help with concrete questions.
The benchmark you shared looks promising. Any problems left?
from tabula-py.
Ah, sorry for the confusion. I solved the issue with separating test files and processes. #356 (comment)
It's okay to merge into master for now.
from tabula-py.
Found a weird error blames of jpype. I can't reproduce it, so need to get some help.
https://stackoverflow.com/questions/77077943/pyspark-tabula-py-read-pdf-error-no-module-named-org-apache-commons/77092171#77092171
https://stackoverflow.com/questions/77089769/tabula-py-java-lang-classnotfoundexception-java-lang-classnotfoundexception-o
from tabula-py.
It presumably is a specific environment issue with jpype.
https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-read-in-a-ms-access-accdb-database-into/td-p/3217
I will implement a workaround to enable subprocess as an option.
from tabula-py.
Sorry for the problems, I'm afraid I don't know the cause either.
I'd suggest you ask about this upstream on jpype's discussion page, maybe they have seen this already and know how to fix it.
from tabula-py.
Related Issues (20)
- Unable to remove note in log : Got stderr: Picked up _JAVA_OPTIONS: -Djava.awt.headless=true HOT 1
- Tabula py Ignores an entire column if it's blank and if it does not contain headerd? HOT 1
- tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', HOT 3
- dont ignore empty columns in tables spanning multiple pages HOT 1
- Try to install tabula-py HOT 1
- Add a way to set areas for non-existent pages in template HOT 4
- Exception: RuntimeError: java.lang.UnsatisfiedLinkError: HOT 2
- cant install tabula-py on m1 mac vscode. HOT 1
- Support Python 3.12 HOT 5
- Pls add "orientation" parameter to read_pdf HOT 4
- Security vulnerability in tabula-1.0.5-jar-with-dependencies.jar HOT 4
- [BUG] Encoding still being overridden even after fix to #371. HOT 5
- FutureWarning: errors='ignore' is deprecated and will raise in a future version. HOT 3
- Unable to detect table with longer header information HOT 4
- [BUG] issue just running sample code HOT 1
- Table detection in images HOT 1
- [BUG] <FutureWarning: errors='ignore' > HOT 3
- [BUG] Error importing jpype dependencies. Fallback to subprocess. No module named 'org.apache' HOT 1
- [BUG] column parameter of read_pdf currently needs to be list, not generic iterable HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabula-py.