krasjet / pdf.tocgen Goto Github PK

View Code? Open in Web Editor NEW

607.0 8.0 24.0 440 KB

A CLI toolset to generate table of contents for PDF files automatically.

Home Page: https://krasjet.com/voice/pdf.tocgen/

License: GNU General Public License v3.0

Makefile 1.06% Python 91.13% TeX 5.92% Shell 1.89%

pdf cli toc-generator scraping table-of-contents pdf-document pdf-files pymupdf

pdf.tocgen's Introduction

pdf.tocgen

                          in.pdf
                            |
                            |
     +----------------------+--------------------+
     |                      |                    |
     V                      V                    V
+----------+          +-----------+         +----------+
|          |  recipe  |           |   ToC   |          |
| pdfxmeta +--------->| pdftocgen +-------->| pdftocio +---> out.pdf
|          |          |           |         |          |
+----------+          +-----------+         +----------+

pdf.tocgen is a set of command-line tools for automatically extracting and generating the table of contents (ToC) of a PDF file. It uses the embedded font attributes and position of headings to deduce the basic outline of a PDF file.

It works best for PDF files produces from a TeX document using pdftex (and its friends pdflatex, pdfxetex, etc.), but it's designed to work with any software-generated PDF files (i.e. you shouldn't expect it to work with scanned PDFs). Some examples include troff/groff, Adobe InDesign, Microsoft Word, and probably more.

Please see the homepage for a detailed introduction.

Installation

pdf.tocgen is written in Python 3. It is known to work with Python 3.7 to 3.11 on Linux, Windows, and macOS (On BSDs, you probably need to build PyMuPDF yourself). Use

$ pip install -U pdf.tocgen

to install the latest version systemwide. Alternatively, use pipx or

$ pip install -U --user pdf.tocgen

to install it for the current user. I would recommend the latter approach to avoid messing up the package manager on your system.

If you are using an Arch-based Linux distro, the package is also available on AUR. It can be installed using any AUR helper, for example yay:

$ yay -S pdf.tocgen

Workflow

The design of pdf.tocgen is influenced by the Unix philosophy. I intentionally separated pdf.tocgen to 3 separate programs. They work together, but each of them is useful on their own.

pdfxmeta: extract the metadata (font attributes, positions) of headings to build a recipe file.
pdftocgen: generate a table of contents from the recipe.
pdftocio: import the table of contents to the PDF document.

You should read the example on the homepage for a proper introduction, but the basic workflow follows like this.

First, use pdfxmeta to search for the metadata of headings, and generate heading filters using the automatic setting

$ pdfxmeta -p page -a 1 in.pdf "Section" >> recipe.toml
$ pdfxmeta -p page -a 2 in.pdf "Subsection" >> recipe.toml

Note that page needs to be replaced by the page number of the search keyword.

The output recipe.toml file would contain several heading filters, each of which specifies the attribute of a heading at a particular level should have.

An example recipe file would look like this:

[[heading]]
level = 1
greedy = true
font.name = "Times-Bold"
font.size = 19.92530059814453

[[heading]]
level = 2
greedy = true
font.name = "Times-Bold"
font.size = 11.9552001953125

Then pass the recipe to pdftocgen to generate a table of contents,

$ pdftocgen in.pdf < recipe.toml
"Preface" 5
    "Bottom-up Design" 5
    "Plan of the Book" 7
    "Examples" 9
    "Acknowledgements" 9
"Contents" 11
"The Extensible Language" 14
    "1.1 Design by Evolution" 14
    "1.2 Programming Bottom-Up" 16
    "1.3 Extensible Software" 18
    "1.4 Extending Lisp" 19
    "1.5 Why Lisp (or When)" 21
"Functions" 22
    "2.1 Functions as Data" 22
    "2.2 Defining Functions" 23
    "2.3 Functional Arguments" 26
    "2.4 Functions as Properties" 28
    "2.5 Scope" 29
    "2.6 Closures" 30
    "2.7 Local Functions" 34
    "2.8 Tail-Recursion" 35
    "2.9 Compilation" 37
    "2.10 Functions from Lists" 40
"Functional Programming" 41
    "3.1 Functional Design" 41
    "3.2 Imperative Outside-In" 46
    "3.3 Functional Interfaces" 48
    "3.4 Interactive Programming" 50
[--snip--]

which can be directly imported to the PDF file using pdftocio,

$ pdftocgen in.pdf < recipe.toml | pdftocio -o out.pdf in.pdf

Or if you want to edit the table of contents before importing it,

$ pdftocgen in.pdf < recipe.toml > toc
$ vim toc # edit
$ pdftocio in.pdf < toc

Each of the three programs has some extra functionalities. Use the -h option to see all the options you could pass in.

Command examples

Because of the modularity of design, each program is useful on its own, despite being part of the pipeline. This section will provide some more examples on how you could use them. Feel free to come up with more.

`pdftocio`

pdftocio should best demonstrate this point, this program can do a lot on its own.

To display existing table of contents in a PDF to stdout:

$ pdftocio doc.pdf
"Level 1 heading 1" 1
    "Level 2 heading 1" 1
        "Level 3 heading 1" 2
        "Level 3 heading 2" 3
    "Level 2 heading 2" 4
"Level 1 heading 2" 5

To write existing table of contents in a PDF to a file named toc:

$ pdftocio doc.pdf > toc

To write a toc file back to doc.pdf:

$ pdftocio doc.pdf < toc

To specify the name of output PDF:

$ pdftocio -o out.pdf doc.pdf < toc

To copy the table of contents from doc1.pdf to doc2.pdf:

$ pdftocio -v doc1.pdf | pdftocio doc2.pdf

Note that the -v flag helps preserve the vertical positions of headings during the copy.

To print the table of contents for reading:

$ pdftocio -H doc.pdf
Level 1 heading 1 ··· 1
    Level 2 heading 1 ··· 1
        Level 3 heading 1 ··· 2
        Level 3 heading 2 ··· 3
    Level 2 heading 2 ··· 4
Level 1 heading 2 ··· 5

`pdftocgen`

If you have obtained an existing recipe rcp.toml for doc.pdf, you could apply it and print the outline to stdout by

$ pdftocgen doc.pdf < rcp.toml
"Level 1 heading 1" 1
    "Level 2 heading 1" 1
        "Level 3 heading 1" 2
        "Level 3 heading 2" 3
    "Level 2 heading 2" 4
"Level 1 heading 2" 5

To output the table of contents to a file called toc:

$ pdftocgen doc.pdf < rcp.toml > toc

To import the generated table of contents to the PDF file, and output to doc_out.pdf:

$ pdftocgen doc.pdf < rcp.toml | pdftocio -o doc_out.pdf doc.pdf

To print the generated table of contents for reading:

$ pdftocgen -H doc.pdf < rcp.toml
Level 1 heading 1 ··· 1
    Level 2 heading 1 ··· 1
        Level 3 heading 1 ··· 2
        Level 3 heading 2 ··· 3
    Level 2 heading 2 ··· 4
Level 1 heading 2 ··· 5

If you want to include the vertical position in a page for each heading, use the -v flag

$ pdftocgen -v doc.pdf < rcp.toml
"Level 1 heading 1" 1 306.947998046875
    "Level 2 heading 1" 1 586.3488159179688
        "Level 3 heading 1" 2 586.5888061523438
        "Level 3 heading 2" 3 155.66879272460938
    "Level 2 heading 2" 4 435.8687744140625
"Level 1 heading 2" 5 380.78875732421875

pdftocio can understand the vertical position in the output to generate table of contents entries that link to the exact position of the heading, instead of the top of the page.

$ pdftocgen -v doc.pdf < rcp.toml | pdftocio doc.pdf

Note that the default output of pdftocio here is doc_out.pdf.

`pdfxmeta`

To search for Anaphoric in the entire PDF:

$ pdfxmeta onlisp.pdf "Anaphoric"
14. Anaphoric Macros:
    font.name = "Times-Bold"
    font.size = 9.962599754333496
    font.color = 0x000000
    font.superscript = false
    font.italic = false
    font.serif = true
    font.monospace = false
    font.bold = true
    bbox.left = 308.6400146484375
    bbox.top = 307.1490478515625
    bbox.right = 404.33282470703125
    bbox.bottom = 320.9472351074219
[--snip--]

To output the result as a heading filter with the automatic settings,

$ pdfxmeta -a 1 onlisp.pdf "Anaphoric"
[[heading]]
# 14. Anaphoric Macros
level = 1
greedy = true
font.name = "Times-Bold"
font.size = 9.962599754333496
# font.size_tolerance = 1e-5
# font.color = 0x000000
# font.superscript = false
# font.italic = false
# font.serif = true
# font.monospace = false
# font.bold = true
# bbox.left = 308.6400146484375
# bbox.top = 307.1490478515625
# bbox.right = 404.33282470703125
# bbox.bottom = 320.9472351074219
# bbox.tolerance = 1e-5
[--snip--]

which can be directly write to a recipe file:

$ pdfxmeta -a 1 onlisp.pdf "Anaphoric" >> recipe.toml

To case-insensitive search for Anaphoric in the entire PDF:

$ pdfxmeta -i onlisp.pdf "Anaphoric"
to compile-time. Chapter 14 introduces anaphoric macros, which allow you to:
    font.name = "Times-Roman"
    font.size = 9.962599754333496
    font.color = 0x000000
    font.superscript = false
    font.italic = false
    font.serif = true
    font.monospace = false
    font.bold = false
    bbox.left = 138.60000610351562
    bbox.top = 295.6583557128906
    bbox.right = 459.0260009765625
    bbox.bottom = 308.948486328125
[--snip--]

Use regular expression to case-insensitive search search for Anaphoric in the entire PDF:

$ pdfxmeta onlisp.pdf "[Aa]naphoric"
to compile-time. Chapter 14 introduces anaphoric macros, which allow you to:
    font.name = "Times-Roman"
    font.size = 9.962599754333496
    font.color = 0x000000
    font.superscript = false
    font.italic = false
    font.serif = true
    font.monospace = false
    font.bold = false
    bbox.left = 138.60000610351562
    bbox.top = 295.6583557128906
    bbox.right = 459.0260009765625
    bbox.bottom = 308.948486328125
[--snip--]

To search only on page 203:

$ pdfxmeta -p 203 onlisp.pdf "anaphoric"
anaphoric if, called:
    font.name = "Times-Roman"
    font.size = 9.962599754333496
    font.color = 0x000000
    font.superscript = false
    font.italic = false
    font.serif = true
    font.monospace = false
    font.bold = false
    bbox.left = 138.60000610351562
    bbox.top = 283.17822265625
    bbox.right = 214.81094360351562
    bbox.bottom = 296.4683532714844
[--snip--]

To dump the entire page of 203:

$ pdfxmeta -p 203 onlisp.pdf
190:
    font.name = "Times-Roman"
    font.size = 9.962599754333496
    font.color = 0x000000
    font.superscript = false
    font.italic = false
    font.serif = true
    font.monospace = false
    font.bold = false
    bbox.left = 138.60000610351562
    bbox.top = 126.09941101074219
    bbox.right = 153.54388427734375
    bbox.bottom = 139.38951110839844
[--snip--]

To dump the entire PDF document:

$ pdfxmeta onlisp.pdf
i:
    font.name = "Times-Roman"
    font.size = 9.962599754333496
    font.color = 0x000000
    font.superscript = false
    font.italic = false
    font.serif = true
    font.monospace = false
    font.bold = false
    bbox.left = 458.0400085449219
    bbox.top = 126.09941101074219
    bbox.right = 460.8096008300781
    bbox.bottom = 139.38951110839844
[--snip--]

Development

If you want to modify the source code or contribute anything, first install poetry, which is a dependency and package manager for Python used by pdf.tocgen. Then run

$ poetry install

in the root directory of this repository to set up development dependencies.

If you want to test the development version of pdf.tocgen, use the poetry run command:

$ poetry run pdfxmeta in.pdf "pattern"

Alternatively, you could also use the

$ poetry shell

command to open up a virtual environment and run the development version directly:

(pdf.tocgen) $ pdfxmeta in.pdf "pattern"

Before you send a patch or pull request, make sure the unit test passes by running:

$ make test

GUI front end

If you are a Emacs user, you could install Daniel Nicolai's toc-mode package as a GUI front end for pdf.tocgen, though it offers many more functionalities, such as extracting (printed) table of contents from a PDF file. Note that it uses pdf.tocgen under the hood, so you still need to install pdf.tocgen before using toc-mode as a front end for pdf.tocgen.

License

pdf.tocgen itself a is free software. The source code of pdf.tocgen is licensed under the GNU GPLv3 license. However, the recipes in the recipes directory is separately licensed under the CC BY-NC-SA 4.0 License to prevent any commercial usage, and thus not included in the distribution.

pdf.tocgen is based on PyMuPDF, licensed under the GNU GPLv3 license, which is again based on MuPDF, licensed under the GNU AGPLv3 license. A copy of the AGPLv3 license is included in the repository.

If you want to make any derivatives based on this project, please follow the terms of the GNU GPLv3 license.

pdf.tocgen's People

Stargazers

Watchers

pdf.tocgen's Issues

no table of contents found

wk6.pdf
It is converted from a google doc, is there an easy work around for this type of doc?
or do I have to manully create the TOC?

pdftocio throws error when saving

Hi, I'm trying to use pdftocio to generate a ToC for a pdf. I have already generated the toc file using pdfxmeta and 'pdftocgen`. The error is as following:

Traceback (most recent call last):
  File "/home/amzon-ex/.local/bin/pdftocio", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/amzon-ex/.local/lib/python3.11/site-packages/pdftocio/app.py", line 163, in main
    doc.save(out)
  File "/home/amzon-ex/.local/lib/python3.11/site-packages/fitz/fitz.py", line 4710, in save
    return _fitz.Document_save(self, filename, garbage, clean, deflate, deflate_images, deflate_fonts, incremental, ascii, expand, linear, no_new_id, appearance, pretty, encryption, permissions, owner_pw, user_pw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: cannot find object in xref (2818 0 R)

Can you help?

multiline headings can't be merged

The expected output for below configuration is to generate `Chapter 1: Introduction" as one heading, but it generates as two separate headings pointing to the same page.

[[heading]]
# Chapter 1
level = 1
greedy = true
font.name = "ComputerModernBoldExtend"
font.size = 24.875999450683594

[[heading]]
# Introduction
level = 1
greedy = true
font.name = "ComputerModernBoldExtend"
font.size = 29.844999313354492

It should merge the headings at the same level, right? Or maybe we should allow to specify a heading pattern regex.

"KeyError: 'to'" when extracting ToC from some PDFs with pdftocio

Synopsis

When using pdftocio to extract a table of contents, it sometimes crashes with the error "KeyError: 'to'".

Steps to Reproduce

Create a PDF (named e.g. "toc.pdf") with an outline item (AKA bookmark) that fits the entire page to the view (I'm not certain of the terminology here; the PDF standard doesn't seem to define any special term for this; the Destination syntax has names, such as /Fit and /XYZ, that specify how the page is to be displayed but doesn't define a term for these names or the behavior they specify).
Use pdftocio to extract the ToC from the PDF, (e.g. run pdftocio toc.pdf)

Expected Result

A table of contents (in the format used by pdf.tocgen) should be output.

Actual Result

pdftocio outputs a stack trace and an error:

 Traceback (most recent call last):
  File "/usr/local/bin/pdftocio", line 8, in <module>
    sys.exit(main())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdftocio/app.py", line 142, in main
    toc = read_toc(doc)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdftocio/tocio.py", line 16, in read_toc
    return [
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdftocio/tocio.py", line 17, in <listcomp>
    ToCEntry(e[0], e[1], e[2], e[3]['to'].y) if len(e) == 4 else
KeyError: 'to'

Additional Info

A minimal sample PDF was created containing a single bookmark with the view fit set to "Page" (/Fit. This causes pdftocio to generate the above error. Printing the value of e that causes the exception gives:

[1, 'Page', 1, {'kind': 4, 'xref': 2, 'name': 'page=1&view=Fit', 'zoom': 0.0}]

For comparison, this is the entry for a bookmark created by pdftocio (a fit of /XYZ, fit top-left and zoom):

[[1, 'Page', 1, {'kind': 1, 'xref': 2, 'page': 0, 'to': Point(0.0, 0.0), 'zoom': 0.7222222089767456}]]

Other fits also cause this error: /FitH (fit width), /FitV (fit height), '/FitR' (fit rectangle). I wasn't able to test the bounding box fits, /FitB, /FitBH and /FitBV.

Looking at the function containing the line where the error is generated (read_toc), it tests whether e has 4 entries (to avoid an exception), but it assumes that if the 4th entry is present, it will be an object containing a "to" key. However, fitz (which provides the method, fitz.Document.get_toc, used to generate e) doesn't seem to create this key except when the fit is /XYZ. This last might be a bug within fitz, but it still can and should be addressed in pdftocio. Is expanding the test to check whether the key is present (len(e) == 4 and "to" in e) sufficient, or will there be some other problem if e[3] exists but doesn't contain a "to" item?

Long term, supporting different fits might be the best solution (and a nice new feature), as this would both allow preserving them for an existing ToC and allow users to set them.

Vertical position seems to be incorrect for a file

Sample PDF: https://www.math.ias.edu/~lurie/papers/HTT.pdf

If I run pdftocgen -v with the TOML file

level = 1
greedy = true
font.name = "CMBX10"
font.size = 9.962599754333496

I get a TOC with one line being

"1.1 FOUNDATIONS FOR HIGHER CATEGORY THEORY" 19 484.73089599609375

and if I use pdftocio to write the TOC into the PDF, and open the result PDF by PDF viewers (I tested zathura and okular), the vertical position seems to be incorrect.

pdf.tocgen v1.3.3
Python 3.11.2

Combining characters

It seems pdftocgen does not support unicode combining characters (like most diacritical marks), and since they are very common in most non-English languages I think it would be nice to support them. Other tools like pdftotext and evince can handle such characters correctly, so I think this is not an issue with the PDF files themselves.

Example: t1.pdf
pdftotext t1.pdf prints Matemática, but pdfxmeta t1.pdf prints:

Matem´atica:
    font.name = "CMBX12"
    font.size = 14.346199989318848
    font.color = 0x000000
    [...]

[feature request] All caps headings

First of all, thanks for the project! It is super cool!
I think all-caps would be a great filter for some headings. Lots of books use all-caps for subheadings.

Installation using pip3 not working

I'm probably doing something wrong, because I can see the project homepage at PyPI, but pip3 install does not work for me:

$ pip3 --version
pip 21.3.1 from /home/xxx/.local/lib/python3.6/site-packages/pip (python 3.6)
$ pip3 install pdf.tocgen
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement pdf.tocgen (from versions: none)
ERROR: No matching distribution found for pdf.tocgen

Any pointers would be very helpful. Thanks very much!

error: invalid literal for int() with base 10: ''

Thanks for the awesome project!

While this works very well after some tweaking of the toc for most documents, I am having trouble re-writing the toc for a fairly large pdf (no starch press's Python Crash Course)

I get the following error when running pdftocio pythoncrashcourse_updated.pdf < toc

error: invalid literal for int() with base 10: ''

This seems to indicate that there is a number somewhere (possibly a page number) being read as a string. I've quiet carefully gone through the document but can't seem to find the issue. The error is also not particularly verbose.

[feature request] Make this into one script

This is a lovely tool and works perfectly. It would be even easy to use if we can link them all together with one script.

This script might run with arguments:

filename
[(page, string)] to search the xmeta.

and with interactive:

confirm the generated toc
open the generated pdf file to further check.
maybe also confirm the output file path.

Not working

I just installed pdf.tocgen without any errors.

When running I got this error messages:

C:\Users\fn31>pdfxmeta -p page -a 1 ABasicStudy.pdf "Section" >> recipe.toml

Traceback (most recent call last):
File "C:\Users\fn31\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\fn31\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\fn31\AppData\Local\Programs\Python\Python310\Scripts\pdfxmeta.exe_main.py", line 7, in
File "C:\Users\fn31\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfxmeta\app.py", line 92, in main
page = int(a)
ValueError: invalid literal for int() with base 10: 'page'

ABasicStudy.pdf

ValueError: invalid literal for int() with base 10: 'page'

Hi, Thank you for this great work!

I'm looking to detect and extract pdf TOC, but all my tests (with short or long and complex pdf) raise the following error: ValueError: invalid literal for int() with base 10: 'page'.

For example with this one: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/Dart.pdf or this one (not a LaTex file).

(pdftocgen) user@MacBook-Pro-de-user Desktop % pdfxmeta -p page -a 1 Dart.pdf "Section" >> recipe.toml   
Traceback (most recent call last):
  File "/Users/user/miniconda3/envs/pdftocgen/bin/pdfxmeta", line 8, in <module>
    sys.exit(main())
  File "/Users/user/miniconda3/envs/pdftocgen/lib/python3.9/site-packages/pdfxmeta/app.py", line 91, in main
    page = int(a)
ValueError: invalid literal for int() with base 10: 'page'

I am on macOS and the installation of pdf.tocgen with the command pip install -U pdf.tocgen worked well (I detailed the way here).

Do you know why this error is raised? Thanks again.

pdftocio — support for vertical positions for outputs

Given that pdftocio supports importing the vertical positions of ToC, it seems also desirable to support exporting the vertical positions of ToC of a PDF.

Identify headings by alignment

Hi,
is there a way to identify headings by paragraph alignment? In some documents, the font is always the same (both for name and size) and the bold weight is used both in paragraphs and headings. One way to distinguish between paragraphs and headings could be the alignment.

Based on the documentation the meta information we have is just the position of the bounding box: there is a way to get all boxes with, for instance, left margin between a range of values? Or maybe in which right and left margin are equal? Or select all bold text starting from the left margin (in case we have headings with left alignment)?

Thank you for your excellent work!

Print page's text when pdfxmeta command fails

I have a document for which I am trying to define the levels of headings in the pdf using the pdfxmeta command.
However, as you can see for from the example page here using mutool draw -F text out.pdf 1 from the command line:
out.pdf
for some reason the text extracted from that page by mupdf reads as follows:

'Chapter\n1\nIn\ntro\nduction\n... etc.'

i.e. , the text is broken up with new line characters. Now when I give a complete word to the pdfxmeta command, it does not find anything and fails to set a level. However when I provide just that part of a word that is between two newline characters (e.g. 'duction') then subsequently the pdftocgen command works just fine. Of course I only got here by debugging, but it would be great if the pdfxmeta command prints the page's text when it fails to find the given pattern (or otherwise maybe extend the documentation to use the mutool draw -F text ... command).

Thanks for the beautiful package!

ps. I have integrated the functionality of pdf.tocgen to Emacs's toc-mode. You might like toc-mode's remaining functionalities too. I am not sure if you are using Emacs, but if you are using vim you might like to try out Spacemacs (check out to develop branch immediately after downloading).

ValueError: could not convert string to float:

Hi @Krasjet, I'm having another error with issue with the tocparser. When attempting to generate the toc with pdftocio -g higginbotham2015.pdf < toc I get the following error:

Traceback (most recent call last):
  File "/usr/bin/pdftocio", line 33, in <module>
    sys.exit(load_entry_point('pdf.tocgen==1.2.3', 'console_scripts', 'pdftocio')())
  File "/usr/lib/python3.9/site-packages/pdftocio/app.py", line 156, in main
    raise e
  File "/usr/lib/python3.9/site-packages/pdftocio/app.py", line 146, in main
    toc = parse_toc(toc_file)
  File "/usr/lib/python3.9/site-packages/pdftocio/tocparser.py", line 38, in parse_toc
    return list(map(parse_entry, reader))
ValueError: could not convert string to float: 'Brief'

Obviously the error is pointing to the first word of the first line. I checked that I am using unix formatting by running dos2unix just in case something was hidden in the text file, but it didn't help.

toc.txt

How to use python or node child process to call pdftocio?

it seems cannot use python or nodejs to call shell to execute the pdftocio.please help

Headings of the same level getting merged into 1 line

For example:

"Part A: The Statement of Financial Position" 8
    "Usefulness Limitations Classification of Elements" 8
        "Assets Liabilities Shareholders’ Equity Concept Review Exercise: Statement of Financial Position Classification" 8

when it should be

Part A: The Statement of Financial Position
        Usefulness
        Limitations
        Classification of Elements
                Assets
                Liabilities
                Shareholders’ Equity
                Concept Review Exercise: Statement of Financial Position Classification

Those headings getting merged have the same info generated with the only exception of the bbox.top and bbox.bottom values.Any idea of how i should classify?

The original content page in the file also does not have any page numbers shown which might have caused this problem too..

[feature request] add `regex pattern` filter

I have this pdf file: https://docs.ton.org/ton.pdf
I used following recipe to create a toc:

[[heading]]
# TON Blockchain
level = 1
greedy = true
font.name = "F102"
font.size = 17.21540069580078
# font.size_tolerance = 1e-5
# font.color = 0x000000
# font.superscript = false
# font.italic = false
# font.serif = false
# font.monospace = false
# font.bold = false
# bbox.left = 138.70851135253906
# bbox.top = 127.66803741455078
# bbox.right = 274.1837158203125
# bbox.bottom = 144.88343811035156
# bbox.tolerance = 1e-5
[[heading]]
# TON Blockchain as a Collection of 2-Blockchains
level = 2
greedy = true
font.name = "F108"
font.size = 14.346199989318848
# font.size_tolerance = 1e-5
# font.color = 0x000000
# font.superscript = false
# font.italic = false
# font.serif = false
# font.monospace = false
# font.bold = false
# bbox.left = 146.76255798339844
# bbox.top = 291.47509765625
# bbox.right = 486.075927734375
# bbox.bottom = 305.8212890625
# bbox.tolerance = 1e-5
[[heading]]
# 2.1.1. List of blockchain types.
level = 3
greedy = false
font.name = "F104"
font.size = 11.9552001953125
# font.size_tolerance = 1e-5
# font.color = 0x000000
# font.superscript = false
# font.italic = false
# font.serif = false
# font.monospace = false
# font.bold = false
# bbox.left = 110.85400390625
# bbox.top = 395.5226745605469
# bbox.right = 289.56573486328125
# bbox.bottom = 407.52569580078125
# bbox.tolerance = 1e-5

The problem is that level 3 would contain many wrong outputs, for example:

"1 Brief Description of TON Components" 3
        "2 2.1.17 2.4.20" 3
        "3" 3
        "4.1.7" 3
        "4.1.10 3.1.6" 3
        "3.2 3.2.10 3.2.14 3.2.12" 3
        "4 4.3.14 4.3.17 3.2.12 4.1.6" 4
        "4.3.1" 4
        "5" 4
        "4.3.23" 4
        "2.9.13 4.1" 4
"2 TON Blockchain" 5
    "2.1 TON Blockchain as a Collection of 2-Blockchains" 5
        "2.1.17" 5
        "2.1.1. List of blockchain types." 5
        "2.8.8 2.9.7 2.9.8" 5
        "2.8.12 2.8.8" 6
        "2.1.17" 6
        "2.1.2. Innite Sharding Paradigm." 6
        "2.1.3. Messages. Instant Hypercube Routing. 2.4.2 2.4.20" 7
        "2.1.4. Quantity of masterchains, workchains and shardchains." 7

The correct ones all share the same pattern: "\d+\.\d+\.\d+\.. Currently I can delete wrong level 3 lines in vim using this command

:'<,'>g!/"\d\+\.\d\+\.\d\+\./d

But it's better to have a regex pattern matching filter. The filter should be able to:

exclude an output that doesn't match a regex

Save output PDF as original PDF File

First, this tool is awesome! I am starting to like Google Docs as it facilitates collaboration, but I hate that Google Docs doesn't translate the TOC when saving it as a PDF, and making the doc --> docx --> pdf messes with the format.

Anyway: I would like to automatically save the output file (with the TOC) as the original file. That is, I do not want to have two files (keeping the original without the TOC).

I tried:
pdftocio -o original.pdf original.pdf < toc
but I get an error: save to original must be incremental

Is there an easy way to automatically saved the TOC in the original file? Thanks!

Error with pdftocio

I get an error when trying to insert the toc generated by pdftocgen into the pdf with pdftocio... I wonder if it has to do with your warning "don't expect it to work with scanned pdfs" (although the toc generation worked perfectly) or if it is something else... Below is the traceback.

Traceback (most recent call last):

File "/usr/bin/pdftocio", line 33, in
sys.exit(load_entry_point('pdf.tocgen==1.2.2', 'console_scripts', 'pdftocio')())
File "/usr/lib/python3.9/site-packages/pdftocio/app.py", line 147, in main
write_toc(doc, toc)
File "/usr/lib/python3.9/site-packages/pdftocio/tocio.py", line 11, in write_toc
doc.setToC(fitz_toc)
File "/usr/lib/python3.9/site-packages/fitz/utils.py", line 1107, in setToC
doc._updateObject(xref[i], txt) # insert the PDF object
AttributeError: 'Document' object has no attribute '_updateObject'
zsh: exit 1 pdftocio -t toc book.pdf

Couldn't make it run in Python, please help

Hi,

I'm using Python3.8, and here is my code:

import pdfxmeta
import pdftocgen
import pdftocio

pdfxmeta -p 14 onlisp.pdf "The Extensible"

Error message:

"C:\Users\Yu Wang\Python3.8\python.exe" "C:/Users/Yu Wang/PycharmProjects/pythonProject/main.py"
  File "C:/Users/Yu Wang/PycharmProjects/pythonProject/main.py", line 6
    pdfxmeta -p 14 onlisp.pdf "The Extensible"
                ^
SyntaxError: invalid syntax

Here is my current dictionary:

os.getcwd()
'C:\\Users\\Yu Wang\\PycharmProjects\\pythonProject'

My file path is "C:\Users\Yu Wang\PycharmProjects\pythonProject\onlisp.pdf"

Please help!

Thanks so much!

Best Regards,
Yu Wang

[Feature request] Create TOC from the input's outlines (bookmarks)

That's me giving up. I liked very much your tools, but I couldn't figure out a way to submit a PR for this feature.

Fitz and Fitzutils both seem to have tools that would trivialize such a feature. A flag like -b (for bookmarks) and invoking and processing fitz.Document.get_toc(False) probably could do the trick, except that going back and forth from fitz.py to fitzutils.py to pdfexmeta.py and app.py confused me to a point I couldn't even print this single line!

Well, I'm just a beginner, but maybe you'll have some spare time for implementing this.

Thank you!

pdfxmeta issue

Deprecation: 'getTextPage' removed from class 'Page' after v1.19.0 - use 'get_textpage'.

ToC Positioning Accuracy

First, thanks to your project! It is very useful! But I have a little question about the positioning accuracy of ToC.

Generated titles seem to be specific to one page, not to a position within a page. This will cause if my title is at the end of a page, I can't jump to the title after clicking ToC, but need to keep scrolling down to find it. And if my multiple headers are on the same page, when I click on the ToC, the page doesn't even change anything.

Also I found that the headers added manually by the PDF reader don't have this problem.

Hope to your reply. Thank you!

Error when trying to run

I just installed the script, and without any errors.

When running the script, I get this error:

C:\Users\fn31>pdfxmeta -p page -a 1 C:\Users\fn31\OneDrive\Desktop\in.pdf "Section" >>recipe.tom
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "C:\Users\fn31\AppData\Local\Programs\Python\Python312\Scripts\pdfxmeta.exe_main.py", line 7, in
File "C:\Users\fn31\AppData\Local\Programs\Python\Python312\Lib\site-packages\pdfxmeta\app.py", line 92, in main
page = int(a)
^^^^^^
ValueError: invalid literal for int() with base 10: 'page'

C:\Users\fn31>
in.pdf

default vertical position of bookmarks is not 0

I want to create a toc for a scanned document, which currently looks kinda like

"§24 Partielle Ableitungen" 28
    "24.1 Satz von Schwarz" 30
"§25 Differenzierbarkeit" 33

Using that, the bookmarks link to some small offset in the pdf (roughly as if 35 was given as offset, i.e.

"§24 Partielle Ableitungen" 28 35
    "24.1 Satz von Schwarz" 30 35
"§25 Differenzierbarkeit" 33 35

) and not directly to the top of the page, which is what I would expect them to link to. No offset should mean

"§24 Partielle Ableitungen" 28 0
    "24.1 Satz von Schwarz" 30 0
"§25 Differenzierbarkeit" 33 0

I use version 1.3.4.

pdfxmeta doesn't warn on nonexistent input file

Although the program exits with a status code of 1, it would be helpful to print a warning to stderr that the specified file doesn't exist.

How did you build the homepage?

I mean : https://krasjet.com/voice/pdf.tocgen/

It very cool and looks like you built it automaiclly from some kind of source file like .ipynb.

thanks a lot

[feature request] print bookmarks down to level N only

Hi!

I have a document containing hundreds of pages and for a quick overview I would like to run something like `pdftocio -H --level=3 foo.pdf" to only get levels 1 to 3 reported back.