Code Monkey home page Code Monkey logo

obsidian-extract-pdf's Introduction

Extract PDF text to Markdown

Allows you to extract the basic textual content of a PDF into a Markdown file. Works well with headings, paragraphs and lists.

Demo

How to use this plugin

After you've installed and activated the plugin:

  1. Drag your PDF into Obsidian
  2. Open the PDF within Obsidian
  3. Make sure the pane with your PDF is focused
  4. Click the "PDF to Markdown" button in the sidebar
  5. Edit the generated markdown file to your needs

Tips & Tricks for editing the generated markdown file

I just went ahead and turned a 500 page PDF into markdown and found that it worked better and faster than I expected.

Bulk-removing page footers

The book I used had the same footer on every page. That means they got copied into the markdown file over and over, too.

For bulk search-and-replace I use the Atom editor (https://atom.io):

  1. Copy the footer text into your clipboard
  2. Download and install Atom
  3. Open Atom and open the Markdown file inside
  4. Use "Find -> Find in Buffer" and paste the footer text
  5. Use the button "Replace" or "Replace All" to remove footer text

Remove a single space before a new line of text

Weirdly, sometimes, new lines of text had a space infront of them. Such as:

Some text

...which resulted in Obisidian treating it as a sub-block of the preceding line.

To remove the space for those lines, I used a regular expression search-and-replace:

  1. In "Find in current buffer" activate "Regex Search" (The .* icon)
  2. Enter ^([ ]|\t)+ into the search field
  3. Use the button "Replace" or "Replace All" to remove the space

Known issues

First-time use

If you had a PDF open in Obisidian before you installed and activated the plugin, hitting the button may not work. I've had this issue with other plugins as well. The code just doesn't hook up to already-open files.

The solution is to simply close the PDF note and re-open it. That will allow the plugin to hook into it.

Limited PDF parsing

Please understand that this is a basic, best-effort tool to get basic text and headings from a PDF. It really just gets the text from a pdf and turns it into Markdown. The plugin doesn't handle anything more complex, like tables, images, annotations etc:

  • Does not turn PDF highlights and annotations into MD highlights
  • Does not retain PDF numbered lists
  • Does not skip text in headers and footers

obsidian-extract-pdf's People

Contributors

akaalias avatar lonerifle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

obsidian-extract-pdf's Issues

Extract math formulas from the PDF and including in the markdown note

Many papers have (LaTeX) math formulas and expressions, either inline or as separate equations. These are not extracted properly.

For instance, the following bit of a paper I'm reading:

image

Is extracted as:

 In the multilevel literature it has been well recognized that using the model inEquation 2as the basis of a multilevel regression analysis will lead to a slope estimate that represents neither the within-cluster slope�( w ), nor the between-cluster slope�( b ).Ifwe refer to this estimated slope as�ˆ, it has been shown that it is a weighted sum of the estimated slopes at both levels, that is 

 ��� ˆ ˆ �( b )�(1��)� ˆ ( w ) �� ˆ ( w )��(� ˆ ( b )� ˆ �( w )) 

##### (3) 

 where�can be thought of as a measure indicating the relative amount of variability at the between-cluster level compared to the total variability in the data across all clusters and time points (Mundlak, 1978;Neuhaus & Kalbfleisch, 1998;Raudenbush & Bryk, 2002). Hence, the weight�is not the intraclass correlation, which is independent of the number of time points; instead it is a direct function of the number of time points and becomes smaller as the number of time points increases, as we will see later. The weighted sum�is represented as the dashed line inFigure 2, which lies somewhere between the within-person slope�( w )and the between-person slope�( b ).

And rendered as:

image

It would be great if solved. Thanks in advance!

Garbled problem

When the pdf is in Chinese, the extracted files are all garbled

Plugin button doesn't appear

This plugin sounds great, thanks for creating it. I just installed on Linux and there is no button appearing on the sidebar. The plug is turned on. I tried restarting. Any suggestions? Thanks.

request: add to command palette

Pretty self-explanatory, I want to be able to call the PDF to Markdown command from the command palette.

This is because I don't use the sidebar ribbon and therefore have no access to the plugin without turning it on again, not even a hotkey.

Thanks for the plugin!

Breaking "Better PDF Plugin" - Pdf not showing

Embedded PDFs do not show up. In Better PDF Plugin you can embed PDFs in obsidian pages using text-boxes and a pretty simple syntax (as you can see on the main github page of the project).

I use it a lot, and after disabling and enabling all the plugins I have installed, it turned out that for some strange reason, obsidian-extract-pdf, is the one which does not interact perfectly with the previous plugin.

The PDFs, just do not show up. In edit-mode I can correctly see the text, while in preview-mode, it just vanishes, like the text-box didn't exist.

Example:

############
##Edit Mode##
############
First text

        "url" : "Obsidian/helloworld.pdf"

Second text

###############
##Preview Mode##
###############
First text
Second text

Breaking PDFs opening on mobile

Hi,
We have received several reports that when this plugin is installed PDFs fo not open on mobile.
Perhaps is because you bundle your own pdf.js. You could use the one included in obsidian.
However, I am not 100% this is the issue.

PDF to MD icon not working

Hi akaalias!

I really apreciate your plugins!

There is an issue when I'm trying to extract md with the PLUGIN PDF TO MD

It doesn't work, it doesn't show any other page with de md

I tried:

  • Reloading
  • Restarting the app
  • Uninstalling and Reinstalling
  • The same problem with the PDF Highlights to MD

Please help!

icon too clear

Hello !
This plugin is great, but I cannot see the icon with light mode :
image
(VS dark mode:
image

Doesn't work

Running latest obsidian and updated add-on. First time using. Extract pdf button says "no highlights found". Tested on multiple pdfs within obsidian, marked up in both Mac and windows with with preview, or pdf x-change editor. nothing works, but highlights visual.

Plugin prevents PDF preview in Obsidian

Hi,

I recently noticed the PDFs are not being previewed in Obsidian. By trial and error, I disabled/enabled plugins one by one, and apparently, this plugin is causing the problem.

I'm on Obsidian v0.10.7 and using plugin version 0.0.6.

Plugin opens a dialogue box then does nothing obvious

Steps to reproduce:

  1. Drag your PDF into Obsidian
  2. Open the PDF within Obsidian
  3. Make sure the pane with your PDF is focused
  4. Click the "PDF to Markdown" button in the sidebar

I have let it run for 30 min with no result.
I have dragged in a new PDF with same issue
I have closed and reopened vault -- no change
I have rebooted -- no change

I am on

  • PopOS 20.10
  • Obsidian 0.12.4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.