Code Monkey home page Code Monkey logo

Comments (11)

chezou avatar chezou commented on July 17, 2024 1

Thanks for suggestion. Initially I thought to use py4j, but at that time, tabula-java development was active, so I pended the idea for maintainability.

I'd think tweaking io._run function and TabulaOption could help to implement it.

tabula-py/tabula/io.py

Lines 58 to 102 in b24e3bd

def _run(
java_options: List[str],
options: TabulaOption,
path: Optional[str] = None,
encoding: str = "utf-8",
) -> bytes:
"""Call tabula-java with the given lists of Java options and tabula-py
options, as well as an optional path to pass to tabula-java as a regular
argument and an optional encoding to use for any required output sent to
stderr.
tabula-py options are translated into tabula-java options, see
:func:`build_options` for more information.
"""
# Workaround to enforce the silent option. See:
# https://github.com/tabulapdf/tabula-java/issues/231#issuecomment-397281157
if options.silent:
java_options.extend(
(
"-Dorg.slf4j.simpleLogger.defaultLogLevel=off",
"-Dorg.apache.commons.logging.Log"
"=org.apache.commons.logging.impl.NoOpLog",
)
)
args = ["java"] + java_options + ["-jar", _jar_path()] + options.build_option_list()
if path:
args.append(path)
try:
result = subprocess.run(
args,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdin=subprocess.DEVNULL,
check=True,
)
if result.stderr:
logger.warning(f"Got stderr: {result.stderr.decode(encoding)}")
return result.stdout
except FileNotFoundError:
raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)
except subprocess.CalledProcessError as e:
logger.error(f"Error from tabula-java:\n{e.stderr.decode(encoding)}\n")
raise

Or, we may be able to do something similar as tabula-rs tabulapdf/tabula-java#444

Anyway, I'm looking forward to contribution for the change.

from tabula-py.

mara004 avatar mara004 commented on July 17, 2024 1

Thanks for the considerations!

Initially I thought to use py4j

I think py4j uses a remote tunnel to operate the JVM, which results in a data transfer penalty, while JPype has a native interface with shared memory approach.

but at that time, tabula-java development was active, so I pended the idea for maintainability.

If you call into the CLI rather than using the actual API, you get the same interface as with subprocess (i.e. no maintainability impact), only in a more performant way.
tabulapdf/tabula-java#444 looks like a nice enhancement that would help skip an unnecessary layer of string serialization.

Anyway, I'm looking forward to contribution for the change.

Disclaimer: Unfortunately, I'm fairly busy with some other projects (pypdfium2 & co.) and thus don't have an intent to work on this myself.

from tabula-py.

chezou avatar chezou commented on July 17, 2024 1

@mara004 Thanks for your suggestion. However, I can't reproduce issue by my side and I'm concerned to raise on behalf of them since I don't know what is the trigger.

I released v2.8.2 to automatically fallback to subprocess if there's any import error on jpype.

from tabula-py.

chezou avatar chezou commented on July 17, 2024

I tried jpype on #355

However, I encountered when I run pytest with multiple files, it always fails.
https://github.com/chezou/tabula-py/actions/runs/5987819184

This is a huge blocker for introducing jpype, and I'm about to give up.

from tabula-py.

mara004 avatar mara004 commented on July 17, 2024

I'm still a bit tired, but all the test suite failures seem to be caused by TypeError: _run() takes from 2 to 3 positional arguments but 4 were given.
That's because the encoding argument was removed. Assuming that is correct, just adapt the calling code accordingly. Please don't give up just yet and at least leave the PR open as draft so it is more visible for others to take a look.

from tabula-py.

chezou avatar chezou commented on July 17, 2024

I guess you're trying the old PR. Can you check the new one? #356

from tabula-py.

mara004 avatar mara004 commented on July 17, 2024

I don't currently have the env to test this locally, I can only try to help with concrete questions.
The benchmark you shared looks promising. Any problems left?

from tabula-py.

chezou avatar chezou commented on July 17, 2024

Ah, sorry for the confusion. I solved the issue with separating test files and processes. #356 (comment)

It's okay to merge into master for now.

from tabula-py.

chezou avatar chezou commented on July 17, 2024

Found a weird error blames of jpype. I can't reproduce it, so need to get some help.
https://stackoverflow.com/questions/77077943/pyspark-tabula-py-read-pdf-error-no-module-named-org-apache-commons/77092171#77092171
https://stackoverflow.com/questions/77089769/tabula-py-java-lang-classnotfoundexception-java-lang-classnotfoundexception-o

from tabula-py.

chezou avatar chezou commented on July 17, 2024

It presumably is a specific environment issue with jpype.
https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-read-in-a-ms-access-accdb-database-into/td-p/3217

I will implement a workaround to enable subprocess as an option.

from tabula-py.

mara004 avatar mara004 commented on July 17, 2024

Sorry for the problems, I'm afraid I don't know the cause either.
I'd suggest you ask about this upstream on jpype's discussion page, maybe they have seen this already and know how to fix it.

from tabula-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.