Code Monkey home page Code Monkey logo

Comments (5)

chezou avatar chezou commented on July 18, 2024

@xuxoramos Thanks for reporting it.

Can you paste the actual code and full error message without trimming? I can't reproduce your error on my end.

Also, can you tell me how to install tabula-py? Please share me pip freeze output. I'm wondering if you installed it with jpype option or not. #371 is a patch for jpype, and jpype is not installed by default. Note that using jpype doesn't allow to change the encoding in a single Python process. To change it, you need to reboot the Python process.

Here is my result: I tried to parse the PDF you provided. No error happens.

>>> import tabula
>>> tabula.read_pdf("tmp.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1", pages="all")
[                                             Activity
0              Activity Code Name and Definition Code
1   002 Self-Service AJCC Employment and Workforce...
2                                                 NaN
3   This activity is system generated when an indi...
4   workforce information available in CalJOBS. Wo...
5   as: local performance, availability of support...
6   compensation, and performance and program cost...
7                                                 NaN
...snip...

from tabula-py.

xuxoramos avatar xuxoramos commented on July 18, 2024

This is the entire error output:

java_options is ignored until rebooting the Python process.
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[14], line 1
----> 1 dfs = tb.read_pdf("../sourcedata/caljobs_activity_codes_dictionary.pdf", 
      2                   pages="all", 
      3                   encoding="windows-1252", 
      4                   pandas_options={"encoding":"windows-1252"},
      5                   java_options=["-Dfile.encoding=windows-1252"]
      6                  )

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:395, in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, use_raw_url, pages, guess, area, relative_area, lattice, stream, password, silent, columns, relative_columns, format, batch, output_path, force_subprocess, options)
    392     raise ValueError(f"{path} is empty. Check the file, or download it manually.")
    394 try:
--> 395     output = _run(
    396         tabula_options,
    397         java_options,
    398         path,
    399         encoding=encoding,
    400         force_subprocess=force_subprocess,
    401     )
    402 finally:
    403     if temporary:

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:82, in _run(options, java_options, path, encoding, force_subprocess)
     79 elif set(java_options) - IGNORED_JAVA_OPTIONS:
     80     logger.warning("java_options is ignored until rebooting the Python process.")
---> 82 return _tabula_vm.call_tabula_java(options, path)

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
    115     if result.stderr:
    116         logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117     return result.stdout.decode(self.encoding)
    118 except FileNotFoundError:
    119     raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte

I installed tabula-py from the wheel file to an Anaconda๐Ÿ setup (and hence no requirements.txt, only a ton of dependencies), on a Windows 10 machine that cannot access the internet (for security reaseons), and hence without jpype, so the best way to emulate this env would be to complete shut down your network interface and attempt to replicate the error, I guess.

from tabula-py.

chezou avatar chezou commented on July 18, 2024

Hmm, that sounds weird. I can find that conda-forge's latest version is still v2.7.0. https://anaconda.org/conda-forge/tabula-py

Anyway, your log shows that you are using the subprocess, not jpype. Hence, #371 is unrelated because it is jpype related issue.

Also, I tried Jupyter and ipython on my Windows machine, but I can't reproduce the issue.

In [1]: import tabula
   ...:
   ...: tabula.read_pdf("tmp.pdf", pages="all", encoding="windows-1252", pandas_options={"encoding":"windows-1252"},jav
   ...: a_options=["-Dfile.encoding=windows-1252"])
Error importing jpype dependencies. Fallback to subprocess.
No module named 'jpype'
Out[1]:
[                                             Activity
 0              Activity Code Name and Definition Code
 1   002 Self-Service AJCC Employment and Workforce...
 2                                                 NaN
...snip...

Does it happen just after launching jupyter/ipython? I guess you changed the encoding in the same Python process since the error shows as:

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
    115     if result.stderr:
    116         logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117     return result.stdout.decode(self.encoding)
    118 except FileNotFoundError:
    119     raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte

This suggests that restult.stdout.decode(self.encoding) causes error, i.e., trying decoding with utf-8. Your log shows the cell number is In[14], so I doubt you set utf-8 initially, but you changed to windows-1252.

After supporting jpype in tabula-py, tabula doesn't allow the change of encoding argument after the first read_xxx calling. If you want to change, you can pass force_subprocess=True option, which recreates SubprocessTabula instance.

from tabula-py.

chezou avatar chezou commented on July 18, 2024

Made a potential mitigation on #378. Please try the master branch code and give me a feedback if any.

from tabula-py.

chezou avatar chezou commented on July 18, 2024

Released https://pypi.org/manage/project/tabula-py/release/2.9.1/

from tabula-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.