Summary Reading page with wide header info Did

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Unable to detect table with longer header information about tabula-py HOT 4 CLOSED

dstone42 commented on July 17, 2024

Unable to detect table with longer header information

from tabula-py.

Comments (4)

dstone42 commented on July 17, 2024 1

I'm sorry I wasn't very clear about the pdf. It didn't have a place to put a note next to the pdf link, so I put it right following it in the next section. And I will include the page number next time as well.

I will try the stream option. I don't understand exactly what it's doing, but I'll let you know if that helps. That last table looks like exactly what I want.

Thanks!

from tabula-py.

dstone42 commented on July 17, 2024 1

I tried using the stream=True option for the whole pdf (meaning I used pages='all'), and it seems to work for the other tables as well. I haven't checked all 557, but with some decent spot checking, it looks like it worked. Thanks for your help!

from tabula-py.

chezou commented on July 17, 2024

@dstone42 Next time, could you point to a specific PDF and page?

Looking at the result you've shared, it contains the first 7 lines as another table. This is because tabula-java's table detection algorithm, and the only option you can avoid would be setting area option. Generally, PDF doesn't have table notation, so some detection failure may happen.

>>> dfs = tabula.read_pdf("HINTS 5 Cycle 4 Public Codebook.pdf", pages=27)
>>> dfs[0]
  IGHSPANLI: High linguistically isolated strata (‘Census tracts in which 30% of the households have no adults over the age of 14 that report speaking English
0                                         ery well’)

1                           ariable Name: HIGHSPANLI

2  ariable Label: High linguistically isolated st...

3                           ariable Format: HIGHSPAN

4                   riteria to receive Question: N/A

5                           riteria description: N/A

6                           ack to Table of Contents

7                                                NaN

8                                                NaN

9                                                NaN

>>> dfs[1]
   HIGHSPANLI Value\rLabel Unweighted\rSample\rSize  ... Unnamed: 11 Unnamed: 12 Unnamed: 13
0         NaN          NaN                      NaN  ...         NaN         NaN         NaN
1         NaN          NaN                      NaN  ...         NaN         NaN         NaN
2         NaN   HIGHSPANLI                    Label  ...         NaN         NaN         NaN
3         NaN            1                      NaN  ...         NaN         7.6         NaN
4         NaN            2                      NaN  ...         NaN        92.4         NaN

[5 rows x 21 columns]

When I tried stream=True option, it somewhat ignored the first 7 lines.

>>> dfs = tabula.read_pdf("HINTS 5 Cycle 4 Public Codebook.pdf", pages=27, stream=True)
>>> dfs[0]
   Unnamed: 0 Unnamed: 1  Unnamed: 2  Unnamed: 3  Cumulative     Weighted   Unnamed: 4
0         NaN        NaN  Unweighted         NaN  Unweighted       Sample     Weighted
1         NaN      Value      Sample  Unweighted      Sample         Size      Percent
2  HIGHSPANLI      Label        Size     Percent        Size  (Estimated)  (Estimated)
3           1        Yes         347           9         347   19,266,519          7.6
4           2         No       3,518          91       3,865  234,548,678         92.4

Anyway, it is a tabula-java limitation, so I don't know how to avoid it other than using area option page by page.

from tabula-py.

chezou commented on July 17, 2024

Actually, I did some random tweaking for parameters, and I don't know why the streaming option works well. Anyway, it's not a bug but tabula-java's behavior. Setting an explicit area would be the last resort.

Other than that, I can help nothing, unfortunately.

from tabula-py.

Unable to detect table with longer header information about tabula-py HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent