paperai / pdfanno Goto Github PK

Linguistic Annotation and Visualization Tool for PDF Documents

JavaScript 90.00% CSS 6.25% HTML 3.75%

pdf annotation nlp

pdfanno's Introduction

PDFAnno

PDFAnno is a browser-based linguistic annotation tool for PDF documents.
It offers functions for annotating PDF with labels and relations.
For natural language processing and machine learning, it is suitable for development of gold-standard data with named entity spans, dependency relations, and coreference chains.

If you use PDFAnno, please cite the following paper:

Hiroyuki Shindo, Yohei Munesada and Yuji Matsumoto,
"PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents",
In Proceedings of LREC, 2018.

Online Demo (v0.4.1)

It is highly recommended to use the latest version of Chrome. (Firefox will also be supported in future.)

Installation

If you install PDFAnno locally,

git clone https://github.com/paperai/pdfanno.git
cd pdfanno
npm install
cp .env.example .env

Then, edit .env as you like.
The default values are:

SERVER_PORT=1000

Run Server

npm run server

Usage

Visit the online demo with the latest version of Chrome.
Load your PDF and annotation file (if any). Sample PDFs and annotations are downloadable from here.
- For PDFs located on your computer:
  Put the PDFs and annotation files (if any) in the same directory, then specify the directory via Browse button.
- For PDF available on the Web:
  Access 'https://paperai.github.io/pdfanno/latest/?pdf=' + <URL of the PDF>
  For example, https://paperai.github.io/pdfanno/latest/?pdf=http://www.aclweb.org/anthology/P12-1046.pdf.
Annotate the PDF as you like.
Save your annotations via button.
If you continue the annotation, respecify your directory via Browse button to reload the PDF and anno file.

For security reasons, PDFAnno does NOT automatically save your annotations.
Don't forget to download your current annotations!

Annotation Tools

Icon	Description
	Span highlighting. It is disallowed to cross page boundaries.
	One-way relation. This is used for annotating dependency relation between spans.
	Rectangle. It is disallowed to cross page boundaries.

Annotation File (.anno)

In PDFAnno, an annotation file (.anno) follows TOML format.
Here is an example of anno file:

pdfanno = "0.4.1"
pdfextract = "0.2.4"

[[spans]]
id = "1"
page = 1
label = "label1"
text = "AgBi 0.05 Sb 0.95 Te 2"
textrange = [1422,1438]

[[spans]]
id = "2"
page = 1
label = "label1"
text = "0.48 Wm [NO_UNICODE] 1 K [NO_UNICODE] 1 )"
textrange = [1386,1397]

[[relations]]
head = "1"
tail = "2"
label = "relation1"

where textrange corresponds to the start and end token id of pdftxt.
pdftxt is a text file extracted from the original pdf file.
You can download pdftxt via pdf.txt button at the top right of the screen.

Reference Anno File

To support multi-user annotation, PDFAnno allows to load reference anno file.
For example, if you create a.anno and an another annotator creates b.anno for the same PDF, load a.anno as usual, and load b.anno as a reference file. Then PDFAnno renders a.anno and b.anno with different colors each other. Rendering more than one reference file is also supported.
This is useful to check inter-annotator agreement and resolving annotation conflicts.
Note that the reference files are rendered as read-only.

Contact

Please contact hshindo or feel free to create an issue.

LICENSE

MIT

pdfanno's People

Contributors

Stargazers

Watchers

Forkers

spark-lin sunjieee benjamesbabala labsrs-ref flashriver sunlnus chapter09 johnfelipe winbobob ua1905 w1146869587 christian0730 krzynio h7474 axis-sato bahadirdogru yosimurat aleyan cho-hiroshi mayankachandrashekar experimenti mahendra-ramajayam trungtv navinisoft sanjc chibiramajayam cooleel srepho wandonye huyun0 frdg akimdi cnzhujg shalevy1 syats viralsteroids webest ttklm20 macfire lusoftware kevinmazelin hejin haseebqammarcheema docs-of-all-trades soclassic augix yynnxu embeddedsamurai stefangolas roysh knowledgecluster webstorage119 sampanriver youssef-yo kwnsiy vansnoden siunits

pdfanno's Issues

pdfextract-0.2.4.jar is missing

package.json specifies that pdfextract-0.2.4.jar should be downloaded from https://github.com/paperai/pdfextract/releases/download/untagged-6b0e4f23df695e8b7587/pdfextract-0.2.4.jar however that link 404s. If you don't already have pdfextract-0.2.4.jar then github's 404 error html page gets saved as the pdfextract-0.2.4.jar and processing fails.

Not able to annotate a PDF

Hi,

I have built and installed the app locally. I am trying to annotate a pdf with pdfAnno running at my localhost. When I select or double click on a span, I don't see any option getting populated to annotate it.

Please advise.

Thanks,
Abhinav

Deploy locally

Hey!

I am interested in your project and tool PDFAnno.
I tried to install PDFAnno locally as mentioned in README.md.
All steps are OK except last:
npm run server

Looked in package.json there is not server script.

Can you help me to install locally PDFAnno!

update video demostration

Can u update video demostration, new version is confuse me

install fail

npm install
npm WARN optional Skipping failed optional dependency /chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN optional Skipping failed optional dependency /vuepress/chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN optional Skipping failed optional dependency /watchpack/chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN optional Skipping failed optional dependency /webpack-dev-server/chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN [email protected] requires a peer of webpack@1 || ^2.1.0-beta but none was installed.

npm run server
npm ERR! Linux 4.15.0-24-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "run" "server"
npm ERR! node v8.10.0
npm ERR! npm v3.5.2

npm ERR! missing script: server
npm ERR!
npm ERR! If you need help, you may report this error at:
npm ERR! https://github.com/npm/npm/issues

npm ERR! Please include the following file with any support request:
npm ERR! /home/caseybasichis/Prog/text/pdfanno/npm-debug.log

getting error internal/modules/cjs/loader.js:584 throw err;

am getting below error while running application ,could you please help me to resolve the issue

E:\R&D\POC\pdfanno>npm start

[email protected] start E:\R&D\POC\pdfanno
node index.js

internal/modules/cjs/loader.js:584
throw err;
^

Error: Cannot find module 'E:\R&D\POC\pdfanno\index.js'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:582:15)
at Function.Module._load (internal/modules/cjs/loader.js:508:25)
at Function.Module.runMain (internal/modules/cjs/loader.js:754:12)
at startup (internal/bootstrap/node.js:283:19)
at bootstrapNodeJSCore (internal/bootstrap/node.js:622:3)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: node index.js
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR! C:\Users\shravan\AppData\Roaming\npm-cache_logs\2019-04-11T02_07_53_442Z-debug.log

Relations do not render in Safari

Using Safari (Version 12.0.2 (13606.3.4.1.4)), the relation arrows are not displayed.

Tested on https://paperai.github.io/pdfanno/latest/

Seems to work with Chrome though.

change default port

i need put in port 3020, where edit that?

Demo is not working

After starting demo. I get following error:
Error
Failed to analyze the PDF.
Reason: [object ProgressEvent]

Package pdfextract-0.3.0.jar with version 5.1 to get PDF character coordinates

Hello @hshindo! It looks like @navinisoft created a version 5.1 of this repo that works by using the output of java -classpath pdfextract.jar PDFExtractor example.pdf along with example.pdf to render the PDF and do the highlighting. I was wondering if either of you have that JAR available somewhere, and if you could share it somewhere or add it to this repo. I cannot find the repo in which this JAR is supposed to be anymore: https://web.archive.org/web/20180627073429/https://github.com/paperai/pdfextract/

It would also really help if the README.md explained that each PDF is expected to be uploaded with a corresponding pdf.txt.gz file and that otherwise highlighting won't work. I had to play around with the code for a while and go through a few versions to figure that out.

Thank you for your help! I like the work you guys did on pdfanno and I'm hoping to use it. Cheers!

pdf-sample.pdf.0-3-0.txt.gz what is this for?

Hi,

I am trying to set up your project on local environment but I've got some problem when I want to load my own .pdf file. It seems every pdf file must have .txt.gz version. What is this for? Can I disable it somehow?

Failed to analyze the PDF.
Reason: Error: HTTP 404 - pdftxtファイルのロードに失敗しました。

Thanks

Highlighting and Data text retreival not working

I just tested out your demo on my own pdf. The highlight seems to have some static lines it can follow which was inbetween my line seperation. In addition the text that is highlight is not the text that shows up in the annontation download. Essentially making this product not function in a useable way. I can send along my pdf and annontation file to demostrate.

Thanks for your work !

If you can't deploy app.

I've changed gulpfile.js and updated gulp-cli to version 4.0.2 so everything works properly.
https://github.com/Forfxout/pdfanno

Installation:
$ npm install -g gulp-cli # with node >= 10.
$ npm install
$ npm audit fix # if required, but please note that you don't need to force
$ npm run front:dev

pdfextract-0.3.0.jar is missing

The Link mentioned in package.json https://github.com/paperai/pdfextract/releases/download/v0.3.0/pdfextract-0.3.0.jar for getting pdfextract-0.3.0.jar getting 404 page not found error. where can i get the pdfextract-0.3.0.jar? can some one share latest path?