Code Monkey home page Code Monkey logo

pdfanno's Introduction

PDFAnno

PDFAnno is a browser-based linguistic annotation tool for PDF documents.
It offers functions for annotating PDF with labels and relations.
For natural language processing and machine learning, it is suitable for development of gold-standard data with named entity spans, dependency relations, and coreference chains.

If you use PDFAnno, please cite the following paper:

Hiroyuki Shindo, Yohei Munesada and Yuji Matsumoto,
"PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents",
In Proceedings of LREC, 2018.

It is highly recommended to use the latest version of Chrome. (Firefox will also be supported in future.)

Installation

If you install PDFAnno locally,

git clone https://github.com/paperai/pdfanno.git
cd pdfanno
npm install
cp .env.example .env

Then, edit .env as you like.
The default values are:

SERVER_PORT=1000

Run Server

npm run server

Usage

  1. Visit the online demo with the latest version of Chrome.
  2. Load your PDF and annotation file (if any). Sample PDFs and annotations are downloadable from here.
  3. Annotate the PDF as you like.
  4. Save your annotations via button.
    If you continue the annotation, respecify your directory via Browse button to reload the PDF and anno file.

For security reasons, PDFAnno does NOT automatically save your annotations.
Don't forget to download your current annotations!

Annotation Tools

Icon Description
Span highlighting. It is disallowed to cross page boundaries.
One-way relation. This is used for annotating dependency relation between spans.
Rectangle. It is disallowed to cross page boundaries.

Annotation File (.anno)

In PDFAnno, an annotation file (.anno) follows TOML format.
Here is an example of anno file:

pdfanno = "0.4.1"
pdfextract = "0.2.4"

[[spans]]
id = "1"
page = 1
label = "label1"
text = "AgBi 0.05 Sb 0.95 Te 2"
textrange = [1422,1438]

[[spans]]
id = "2"
page = 1
label = "label1"
text = "0.48 Wm [NO_UNICODE] 1 K [NO_UNICODE] 1 )"
textrange = [1386,1397]

[[relations]]
head = "1"
tail = "2"
label = "relation1"

where textrange corresponds to the start and end token id of pdftxt.
pdftxt is a text file extracted from the original pdf file.
You can download pdftxt via pdf.txt button at the top right of the screen.

Reference Anno File

To support multi-user annotation, PDFAnno allows to load reference anno file.
For example, if you create a.anno and an another annotator creates b.anno for the same PDF, load a.anno as usual, and load b.anno as a reference file. Then PDFAnno renders a.anno and b.anno with different colors each other. Rendering more than one reference file is also supported.
This is useful to check inter-annotator agreement and resolving annotation conflicts.
Note that the reference files are rendered as read-only.

Contact

Please contact hshindo or feel free to create an issue.

LICENSE

MIT

pdfanno's People

Contributors

hshindo avatar kmamiya avatar navinisoft avatar takahirohorie avatar yoheimune avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfanno's Issues

pdfextract-0.2.4.jar is missing

package.json specifies that pdfextract-0.2.4.jar should be downloaded from https://github.com/paperai/pdfextract/releases/download/untagged-6b0e4f23df695e8b7587/pdfextract-0.2.4.jar however that link 404s. If you don't already have pdfextract-0.2.4.jar then github's 404 error html page gets saved as the pdfextract-0.2.4.jar and processing fails.

Not able to annotate a PDF

Hi,

I have built and installed the app locally. I am trying to annotate a pdf with pdfAnno running at my localhost. When I select or double click on a span, I don't see any option getting populated to annotate it.

image

Please advise.

Thanks,
Abhinav

Deploy locally

Hey!

I am interested in your project and tool PDFAnno.
I tried to install PDFAnno locally as mentioned in README.md.
All steps are OK except last:
npm run server

Looked in package.json there is not server script.

Can you help me to install locally PDFAnno!

install fail

npm install
npm WARN optional Skipping failed optional dependency /chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN optional Skipping failed optional dependency /vuepress/chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN optional Skipping failed optional dependency /watchpack/chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN optional Skipping failed optional dependency /webpack-dev-server/chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN [email protected] requires a peer of webpack@1 || ^2.1.0-beta but none was installed.

npm run server
npm ERR! Linux 4.15.0-24-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "run" "server"
npm ERR! node v8.10.0
npm ERR! npm v3.5.2

npm ERR! missing script: server
npm ERR!
npm ERR! If you need help, you may report this error at:
npm ERR! https://github.com/npm/npm/issues

npm ERR! Please include the following file with any support request:
npm ERR! /home/caseybasichis/Prog/text/pdfanno/npm-debug.log

getting error internal/modules/cjs/loader.js:584 throw err;

am getting below error while running application ,could you please help me to resolve the issue

E:\R&D\POC\pdfanno>npm start

[email protected] start E:\R&D\POC\pdfanno
node index.js

internal/modules/cjs/loader.js:584
throw err;
^

Error: Cannot find module 'E:\R&D\POC\pdfanno\index.js'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:582:15)
at Function.Module._load (internal/modules/cjs/loader.js:508:25)
at Function.Module.runMain (internal/modules/cjs/loader.js:754:12)
at startup (internal/bootstrap/node.js:283:19)
at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: node index.js
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR! C:\Users\shravan\AppData\Roaming\npm-cache_logs\2019-04-11T02_07_53_442Z-debug.log

Demo is not working

After starting demo. I get following error:
Error
Failed to analyze the PDF.
Reason: [object ProgressEvent]

Package pdfextract-0.3.0.jar with version 5.1 to get PDF character coordinates

Hello @hshindo! It looks like @navinisoft created a version 5.1 of this repo that works by using the output of java -classpath pdfextract.jar PDFExtractor example.pdf along with example.pdf to render the PDF and do the highlighting. I was wondering if either of you have that JAR available somewhere, and if you could share it somewhere or add it to this repo. I cannot find the repo in which this JAR is supposed to be anymore: https://web.archive.org/web/20180627073429/https://github.com/paperai/pdfextract/

It would also really help if the README.md explained that each PDF is expected to be uploaded with a corresponding pdf.txt.gz file and that otherwise highlighting won't work. I had to play around with the code for a while and go through a few versions to figure that out.

Thank you for your help! I like the work you guys did on pdfanno and I'm hoping to use it. Cheers!

pdf-sample.pdf.0-3-0.txt.gz what is this for?

Hi,

I am trying to set up your project on local environment but I've got some problem when I want to load my own .pdf file. It seems every pdf file must have .txt.gz version. What is this for? Can I disable it somehow?

Failed to analyze the PDF.
Reason: Error: HTTP 404 - pdftxtファイルのロードに失敗しました。

Thanks

Highlighting and Data text retreival not working

I just tested out your demo on my own pdf. The highlight seems to have some static lines it can follow which was inbetween my line seperation. In addition the text that is highlight is not the text that shows up in the annontation download. Essentially making this product not function in a useable way. I can send along my pdf and annontation file to demostrate.

Colored spans

It is more a suggestion than an issue : would it be possible to add colored spans ? They are currently all yellow -tell me if I'm wrong-. It would be really easier for the annotator to distinct spans.

Thanks again for your great work.

Presentation mode arrows don't work

Hi, I am running in an issue where left/right arrows don't work. Did you disable it on purpose or where can be a problem? I thought it is in pdf.js library on default.

Thanks

textrange of spans are incorrect

It seems that the textrange returned in the annotation file is incorrect, when you compare it to the textrange of the pdftxt file.

For example, if I have the a span "Syntatic parsing", in the annotation file its textrange is [1313,1328], but in the pdftxt file it is [1533-1549] (which seems correct).

Thanks for your work !

If you can't deploy app.

I've changed gulpfile.js and updated gulp-cli to version 4.0.2 so everything works properly.
https://github.com/Forfxout/pdfanno

Installation:
$ npm install -g gulp-cli # with node >= 10.
$ npm install
$ npm audit fix # if required, but please note that you don't need to force
$ npm run front:dev

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.