openpaperwork / paperwork Goto Github PK

View Code? Open in Web Editor NEW

2.4K 111.0 149.0 18.63 MB

Personal document manager (Linux/Windows) -- Moved to Gnome's Gitlab

Home Page: https://gitlab.gnome.org/World/OpenPaperwork/paperwork

Shell 0.56% Python 98.20% CSS 0.19% Makefile 1.05%

document-management personal-document-system dms edms python python3 ocr indexing gtk gtk3 sane pdf scanner paperwork

paperwork's Introduction

Paperwork - openpaper.work

Moved to Gnome's Gitlab.

paperwork's People

Contributors

Stargazers

Watchers

Forkers

quent57 chrisz kryskool jpic lzyr kigeia jaesivsm krap ofaurax waytai gerancheg diopib irgana bignaux rdroro gotomypc happysir rvandegrift mirodin rzr plietar kernelhacker elmerehbi dontcallmedom swap38 starox kschwank nagyist arthurlutz blueyed mjourdan lxzw drkarl fireae phil777 deveshmittal avaiss alkaphreak sayiho chrmorais fab-kaz sjolicoeur romjerome jehan razeev stevandoh orinocoz jansonzhou neuroradiology vickyonit danielias techscientist perpetual-hydrofoil opencorech nivertech raghughanapuram cornerot adrianhust bradparks rkaramc imjerrybao dgem bryant1410 clstrfcuk kmontenegro commandodev xiaochuanliu smurfix junteudjio tclavier scubbx ricebeans yiqideren itgroupcn sjwang1988 spaetz boguslogin jammy112 multimedix ganesh-git2014 mailaender awesome-archive darkdare apre tdey studio26 zsvanderlaan davidbrcz thperret fallive foudfou ziranyidu teto guanlicome djsea igit-cn itmgr zjucsxxd qqlizhn everbros

paperwork's Issues

Import PDF

Someone suggested being able to import PDF could be useful.

The idea is the following: Instead of focusing only on papers, Paperwork should be able to deal with any kind of bill, letter, whatever. And everybody get some of these by email, in PDF format. This way, instead of having to do 2 search: one with something like beagle or whatever, and one with Paperwork, the user could simply do one single search in Paperwork.

--> it is really useful ?
--> maybe it could be done as a plugin ? (which implies that a plugin mechanism must be implemented)

add tag supports

Being able to tag documents would be nice feature. For instance tags like "bank", "salary", "bike crash", etc ... :)

Having colors on these tags would also be awesome.

Image import function

Add an image import function. --> one image == one page ; one directory == one document.

Use the hOCR output of Tesseract

Since v3, tesseract is able to generate hOCR file. Since v3.01, it is shipped with a configuration file for that (Tesseract Issue 377)
Using this format would avoid having to assemble the .txt and .box file each time the user want to see a page.

Highlight search keywords in the document

Keywords are now hightlighted in the document. It would be even better if search keywords would be highlighted as well.

Merge the main window and the search window

The search bar should be in the main window. Results should take the place of the pages list.

Empty directory created when the first scan is interrupted

When the user interrupts the scan of the first page of document (by pressing a button on the scanner itself or Ctrl-C in the terminal), Paperwork leaves an empty directory.

Support for scanner feeder

Actually, only page per page scan is supported. Support for scanner feeder should be added as well (see simple-scan for instance)

add on option to delete / edit labels

Labels are currently only deleted when no document use them anymore.
Would be nicer if a right click menu allowed to remove one of them from all documents or to edit them (like changing their name of their color)

Suggestions: only suggest search that will actually return results

Currently Paperwork suggestions are based on each keyword individually. But when we are searching with many keywords, suggestions made due to one keyword can be incompatible with the other one(s).

--> Would be nice it would only suggest useful corrections.

Support for Cuneiform

Tesseract is nice, but Cuneiform appears to be pretty good as well (the output is much cleaner). Having Cuneiform support would be nice.

However, it raises a question: Must that appear in the GUI ? Lambda users probably don't care what system they use as long it works fine.
Imho, I think it would be better if Paperwork autodetects OCR systems and use a preference list (for instance 1) cuneiform if available, 2) Tesseract if available, 3) complains to the user). It would keep the GUI simpler.

GUI improvement

The 3 tabs are not really ergonomic.

A nicer layout would be:

On the left, the document list.
- On its top, a tool bar (the same than the current toolbar, but without the exit button)
On the right, the page view.
- On its top, another toolbar, but with a button "previous page", a text field for the page number, and a "next page" button
- On its bottom, the label list (with checkboxes)

Warn when no scanner is found

When Paperwork starts, it looks for a scanner, and disable "scan" buttons if none is found. In case none is found, it would be nice to display a popup with a "retry" button.

Labels with accents are not properly supported

Labels with accents in their name cannot be selected or edited. To fix.

Zip the documents

It could be useful to actually zip each document:

It would bring Paperwork closer to the way OpenDocuments work
It would make documents transfer easier
It would reduce the stress on the filesystem

Language names are in English

In the settings window, the languages are always in English. Translating at least the most common would be nice.
(beware of UTF-8 issues)

Add a 'cancel' option

(based on the Gnome recommendations)

A 'cancel' option could be quite useful for some operations. Mostly those regarding deletion of pages or documents (using the recycling bin could be a good thing too).

All the scan buttons are not insensitive when there is no scanner

When there is no scanner available, the button "scan next page" in the tabs is still sensitive.

Multi-scan: Add pages to the current document

In the multi-scan dialog, it would be nice to be able to add pages as well to the current document.

Calibration: mouse cursor should change

In the calibration frame, the mouse cursor should change, depending if we are on the preview or on a grip.

Display the document ids in a more human-readable way

Document id are basically the date and time at which they were scanned. Instead of displaying them as-is (YYYYMMDD_HHmm_SS), they could be display in a nicer and localized way: Something like "Thu. 21st September 2011" for instance.

Scan progression

It would be better if the scan progression could be reported to the user. However, it seems that this will require to send a patch to whoever is responsible for python-imaging-sane.

Parallelize Tesseract

When a page is scanned, 2 calls to tesseract are made (to figure out the orientation of the page). On multi-CPU computers, these calls could be done in parallel.

Document list is not updated on document creation

When a document is created, the (internal) document lists and the (GUI) list of matching documents should be updated

Add an option to reindex all docs

Add an option in the advanced menu to reindex documents + labels.

Add an option to delete single pages

Currently it's only possible to remove whole document. Removing single page can prove useful in case of crappy scans.
(page scanned twice, or with a poor orientation, etc)

Bring scroll bar up after every page change

When you display a new page, the scroll bar remains where it was on the previous page. Most of the time, the user wants to start reading from the beginning of the page. So the scroll bar should go back up when the user changes the current page.

add scanner settings

It would be better to only keep the useful part of each scan (in other word, the A4 sheet only, and not what's around).
To do so, a scanner calibration window should be added to the settings.

Weight documents proportionnaly to the font size of their keywords

The basic idea is the following one: When you look for documents, you usually think first of the keywords in the title(s) of the target document(s). Usually, these words are written bigger than the other.

--> When documents are found, it would nice to give them scores based on the size of the font used to write these keywords. Next the search results could ordered using this score.

Search: Allow negations

When searching, it would be sometimes useful to reject documents containing specific keywords.
For instance "accident" yields all my earnings statements when I'm actually looking for documents related to my last bike crash.

Warn when tesseract is not available

Currently, if tesseract is not installed, Paperwork starts as usual. Would be nice to have a popup to warn us of this problem. Also scanning and ocr should be disabled.

==> if not sane or not tesseract --> popup + disable scan/ocr

Calibration: When scanning, a busy mouse cursor should be displayed

Weird behavior with delete options

When deleting a document, the search result list is not updated.
When deleting a page, there seem to be too many refreshs.
Also the busy cursor doesn't appear half of the time.
Also a label seem to be missing beside the progress bar.

Multi-scan: Add label selection before scan

When scanning multiple documents, it would be handy to be able to specify which labels must be put on the document before even scanning them.
For instance, someone starting with Paperwork may want to scan all their earning statements in one shot and have the same label put on all of them.

Multi-scan: Show the number of pages already scanned

In the multi-scan dialog, there is a missing feedback: The number of pages already scanned

Indexation progression

When (re)indexing the documents, the progress bar remains at 100% much longer than it should.

Inotify support

Once #20 will be implemented, a support for inotify would be nice. This way, if the user or another software modifies a file, the user won't have to restart Paperwork

Add an option to redo OCR

On my old documents, I don't have the .box files. So words highlighting is not done. Redoing the OCR could be nice
Some times it seems Tesseract screws up. Would be nice to be able to fix that
New versions of Tesseract can come out. In which case it may be useful to redo OCR on all the documents (for instance most of mine were read by Tesseract v2)

Errors at start

When i run paperwork, there are some error messages on console. It seems that paperwork try to open a nonexistant document 20120104_2143_30 which correspond to the current date.

$ paperwork
No handlers could be found for logger "pycountry.db"
Looking for locales in 'locale/fr/LC_MESSAGES/paperwork.mo' ...
Will use locales from 'locale'
Config file found: /home/chris/.paperwork.conf
Try to used UI file ./mainwindow.glade but failed: L'ouverture du fichier « ./mainwindow.glade » a échoué : Aucun fichier ou dossier de ce type
UI file used: src/mainwindow.glade
Main window resized
Exception while trying to get the number of pages of '20120104_2143_30': [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30'
Showing first page of the doc
Showing page '20120104_2143_30 p1'
Unable to show image for '20120104_2143_30 p1': [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30/paper.1.jpg'
Unable to read [/home/chris/papers/20120104_2143_30/paper.1.txt]: [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30/paper.1.txt'

Printing: Printed pages are blurry

Probably due to the conversion / resize process they go through.

Detect the languages available

There are a lot of Tesseract languages available. Instead of using a predefined list, it should be generated based on the trained data files available.

scanner: lazy initialization

Paperwork should only look for a scanner when the user try to scan something.

Move "Delete" options from the file menu to the tabs

"Delete" options are tab specific. They should be a button at the bottom of each tab (like the "add label" button for instance).

Use Gobject and threads

Indexation, scanning, etc should be done in separate threads. Using Gobject, synchronization with the gtk thread could be easily handled.

"Device is busy"

If you do a calibration scan, it is not possible anymore to do a normal scan. If you run Paperwork in a terminal, an exception is raised with the error message "Device is busy".

There are 2 possibilities:

Either I misuse the API of the Python module for Sane
Or the function device.close() of this module is buggy

The user clicks "new document"
When a new page has just been scanned

Export PDF

It would be nice to be able to export documents as PDF (for emails for instance).

http://tfischernet.wordpress.com/2008/11/26/searchable-pdfs-with-linux/
http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/

The only issue here would be the PDF quality in the end (no display issue in some readers ? no weird behavior with Ctrl+F ?)