Code Monkey home page Code Monkey logo

tocpdf's Introduction

tocPDF

by Amin Yahyaabadi

Generates bookmarks from the table of contents already available at the beginning of pdf files.

The plan is to automate the whole procedure (https://github.com/aminya/tocPDF#automated).

Until then here is the manual procedure:

Manual:

Step 1: Extraction of toc pages from PDF:

Use Chrome or software that you already have to extract the pages that contain the table of contents.

Tutorial for extracting pages using Chrome https://www.techadvisor.co.uk/how-to/software/how-extract-pages-from-pdf-3679232/

We refer to this file as tocPDF.

Step 2: Extract table of contents text

Here we extract the text from tocPDF.

Even if your pdf file is searchable, usually when you copy the text the result is not in proper format (like a table).

Preferred Methods:

  • Tabula technology

For searchable PDF only - https://tabula.technology/

* Download and run the software
* Select table of contents and do this for each page
* Hit preview and export extracted data
* Export to csv format
  • Using OCR.space

for both scanned and searchable PDF - https://ocr.space/

* Upload tocPDF
* Check "Do receipt scanning and/or table recognition" option
* Use "Just extract text and show overlay (fastest option)" option.
* download or copy paste the generated text.

We refer to the generated text as tocText.

For the following steps, instead, you can check the following which does a similar thing but with a GUI

https://github.com/ifnoelse/pdf-bookmark/blob/master/README-EN.md

Step 3: Preparing the text of the table of content

Open tocText (txt or csv) with a spreadsheet editor (MS Excel or Google Sheet) or using a text editor.

Edit the text such that each page number is at the beginning of a line, e.g.

1 Cover
2 Table of Contents
5 Chapter 1
+6 Subchapter1
++7 Sub-Subchapter1
25 Chapter 2

Don't forget to add the offset to page number (usually the page numbers in pdf have an offset compared to printed document).

Step 4: Download k2pdfoptdoes:

http://willus.com/k2pdfopt/download/

Step 5: (only for Windows) Disabling the GUI :

Disabling the GUI using this tutorial http://willus.com/k2pdfopt/help/nogui.shtml

Then drag the original pdf file into your shortcut.

Step 6: Run the command:

Windows:

copy toc.txt and source pdf file in the folder of your shortcut for convenience.

Copy-paste the following command in the terminal and press enter.

-mode copy -n -toclist toc.txt srcfile.pdf -o outfile.pdf

Press enter again to start bookmarking.

OSX or Linux:

k2pdfopt -mode copy -n -toclist toc.txt srcfile.pdf -o outfile.pdf

Automated:

For now, I plan to start using available software (e.g. k2pdfoptdoes), and then later make the functionality Julia native (when PDFIO.jl adds pdf write capability).

Current algorithm plan:

  • The user will provide page numbers that contain the table of content.

  • Those pages are read from pdf by Julia

  • Julia will extract these pages (here user can be called to do the cropping of the borders)

  • Julia will send the extracted pages to https://ocr.space/ to do OCR, and then it gets the text from the table of content (using the available APIs (Python, C++, etc))

  • Julia will edit the received text to make it a specified format. (here user can be called to do a review). The prepared text file will be saved.

  • A software is called from Julia (e.g. k2pdfoptdoes from the command line). That software will read the original pdf file and text file and will generate the bookmarks for the pdf and will save it.

  • Also, if the pdf file is searchable, Julia can check the fonts in the whole pdf, and for example, get the text of Bold fonts. (Infix PDF Editor does this.) Manual font providing by the user also can be done ( The expensive Evermap AutoBookmark plugin for Adobe and Nitro PDF do this.)

Other Manual Methods:

Other method using Jpdfbookmark

https://sourceforge.net/projects/jpdfbookmarks/

from https://ebooks.stackexchange.com/a/7763/12921

Prepare the tocText file such that

Chapter 1. The Beginning/23
    Para 1.1 Child of The Beginning/25,FitWidth,96
        Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43
Chapter 2. The Continue/30,TopLeft,120,42
    Para 2.1 Child of The Beginning/32,FitPage

You can OCR the TOC and use regex to fix it.

Load that TOC

Expand all bookmarks (Ctrl + E), select all of them, then go to Tools > Apply Page Offset

Enter the first pages that outmatch the page number in the TOC

You can read its manual (http://jpdfbookmarks.altervista.org/InsertBookmarks.html#1_3_1) or watch a quick video tutorial (https://youtu.be/7DUkvH7_wII?t=30). It has command line mode and can work on Linux, Mac.

Other Methods for step 2:

References:

https://www.willus.com/k2pdfopt/help/k2menu.shtml

https://www.willus.com/k2pdfopt/help/options.shtml

https://ebooks.stackexchange.com/questions/107/how-to-create-clickable-table-of-contents-in-a-pdf

tocpdf's People

Contributors

aminya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

tocpdf's Issues

For 2 column style TOC that is scan/ocr -- suggestion for best tool to extract TOC text?

Hello, Can you suggest best tool to extract the TOC text, from a 2 column TOC style (PDF is scanned and ocr'd).

The problem with OCR space it does not read the text in columns, e.g. first column then second column. Rather it reads left to right, so you get the text in the wrong place

For example: extract result from OCR space is (chapter Six is in column 2 of the TOC and the tool has read it on line 1)

Contents
Number Chapter Six: Units..............:.......48
Length, mass, capacity
Chapter One: Types Of and time.... ....

The problem with Tabular is I could not find any 2 column style TOC template. I tried to create my own template as a new person, and it did a very average job (e.g. did not recognise end of sentence, kept leading ..... before page number. I could not find any auto scripts in sublime text editor to handle the typical TOC edit text issues either.

Nuntber,
Chapter One: Types of,
number ........................................... 2,
Squares and square roots .................,2
Cubes and cube roots .......................,2
Multiples .......................................,4
Prime factorisation ..........................,6
Chapter Two: Using numbers .....1 0,

Tabular is better than OCRspace, in the fact text is in the correct order but still alot of manipulation using Sublime Text Editor to get the "TOC text file " into the required layout to be able to auto-create TOC bookmarks in PDF (ie using one of the apps, pdftk or jpdfbookmarks)

Tabular is currently has no ability to ask questions of help. On github the issue tab is not showing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.