Code Monkey home page Code Monkey logo

paperwork's Introduction

paperwork's People

Contributors

ajira86 avatar arthurlutz avatar avaiss avatar bryant1410 avatar claudex avatar davidbrcz avatar delaere avatar jaesivsm avatar jflesch avatar jfleschwyplay avatar jpartain89 avatar jstenback avatar kernelhacker avatar kigeia avatar krap avatar mailaender avatar mathieuschopfer avatar mjourdan avatar pacien avatar pingtux avatar plietar avatar pscheid92 avatar sbrunner avatar scubbx avatar spaetz avatar swap38 avatar tclavier avatar teto avatar textpreferred avatar tiramiseb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paperwork's Issues

Import PDF

Someone suggested being able to import PDF could be useful.

The idea is the following: Instead of focusing only on papers, Paperwork should be able to deal with any kind of bill, letter, whatever. And everybody get some of these by email, in PDF format. This way, instead of having to do 2 search: one with something like beagle or whatever, and one with Paperwork, the user could simply do one single search in Paperwork.

--> it is really useful ?
--> maybe it could be done as a plugin ? (which implies that a plugin mechanism must be implemented)

add tag supports

Being able to tag documents would be nice feature. For instance tags like "bank", "salary", "bike crash", etc ... :)

Having colors on these tags would also be awesome.

Image import function

Add an image import function. --> one image == one page ; one directory == one document.

Use the hOCR output of Tesseract

Since v3, tesseract is able to generate hOCR file. Since v3.01, it is shipped with a configuration file for that (Tesseract Issue 377)
Using this format would avoid having to assemble the .txt and .box file each time the user want to see a page.

Support for scanner feeder

Actually, only page per page scan is supported. Support for scanner feeder should be added as well (see simple-scan for instance)

add on option to delete / edit labels

Labels are currently only deleted when no document use them anymore.
Would be nicer if a right click menu allowed to remove one of them from all documents or to edit them (like changing their name of their color)

Support for Cuneiform

Tesseract is nice, but Cuneiform appears to be pretty good as well (the output is much cleaner). Having Cuneiform support would be nice.

However, it raises a question: Must that appear in the GUI ? Lambda users probably don't care what system they use as long it works fine.
Imho, I think it would be better if Paperwork autodetects OCR systems and use a preference list (for instance 1) cuneiform if available, 2) Tesseract if available, 3) complains to the user). It would keep the GUI simpler.

GUI improvement

The 3 tabs are not really ergonomic.

A nicer layout would be:

  • On the left, the document list.
    • On its top, a tool bar (the same than the current toolbar, but without the exit button)
  • On the right, the page view.
    • On its top, another toolbar, but with a button "previous page", a text field for the page number, and a "next page" button
    • On its bottom, the label list (with checkboxes)

Warn when no scanner is found

When Paperwork starts, it looks for a scanner, and disable "scan" buttons if none is found. In case none is found, it would be nice to display a popup with a "retry" button.

Zip the documents

It could be useful to actually zip each document:

  • It would bring Paperwork closer to the way OpenDocuments work
  • It would make documents transfer easier
  • It would reduce the stress on the filesystem

Language names are in English

In the settings window, the languages are always in English. Translating at least the most common would be nice.
(beware of UTF-8 issues)

Add a 'cancel' option

(based on the Gnome recommendations)

A 'cancel' option could be quite useful for some operations. Mostly those regarding deletion of pages or documents (using the recycling bin could be a good thing too).

Display the document ids in a more human-readable way

Document id are basically the date and time at which they were scanned. Instead of displaying them as-is (YYYYMMDD_HHmm_SS), they could be display in a nicer and localized way: Something like "Thu. 21st September 2011" for instance.

Scan progression

It would be better if the scan progression could be reported to the user. However, it seems that this will require to send a patch to whoever is responsible for python-imaging-sane.

Parallelize Tesseract

When a page is scanned, 2 calls to tesseract are made (to figure out the orientation of the page). On multi-CPU computers, these calls could be done in parallel.

Add an option to delete single pages

Currently it's only possible to remove whole document. Removing single page can prove useful in case of crappy scans.
(page scanned twice, or with a poor orientation, etc)

Bring scroll bar up after every page change

When you display a new page, the scroll bar remains where it was on the previous page. Most of the time, the user wants to start reading from the beginning of the page. So the scroll bar should go back up when the user changes the current page.

add scanner settings

It would be better to only keep the useful part of each scan (in other word, the A4 sheet only, and not what's around).
To do so, a scanner calibration window should be added to the settings.

Weight documents proportionnaly to the font size of their keywords

The basic idea is the following one: When you look for documents, you usually think first of the keywords in the title(s) of the target document(s). Usually, these words are written bigger than the other.

--> When documents are found, it would nice to give them scores based on the size of the font used to write these keywords. Next the search results could ordered using this score.

Search: Allow negations

When searching, it would be sometimes useful to reject documents containing specific keywords.
For instance "accident" yields all my earnings statements when I'm actually looking for documents related to my last bike crash.

Warn when tesseract is not available

Currently, if tesseract is not installed, Paperwork starts as usual. Would be nice to have a popup to warn us of this problem. Also scanning and ocr should be disabled.

==> if not sane or not tesseract --> popup + disable scan/ocr

Weird behavior with delete options

When deleting a document, the search result list is not updated.
When deleting a page, there seem to be too many refreshs.
Also the busy cursor doesn't appear half of the time.
Also a label seem to be missing beside the progress bar.

Multi-scan: Add label selection before scan

When scanning multiple documents, it would be handy to be able to specify which labels must be put on the document before even scanning them.
For instance, someone starting with Paperwork may want to scan all their earning statements in one shot and have the same label put on all of them.

Indexation progression

When (re)indexing the documents, the progress bar remains at 100% much longer than it should.

Inotify support

Once #20 will be implemented, a support for inotify would be nice. This way, if the user or another software modifies a file, the user won't have to restart Paperwork

Add an option to redo OCR

  1. On my old documents, I don't have the .box files. So words highlighting is not done. Redoing the OCR could be nice
  2. Some times it seems Tesseract screws up. Would be nice to be able to fix that
  3. New versions of Tesseract can come out. In which case it may be useful to redo OCR on all the documents (for instance most of mine were read by Tesseract v2)

Errors at start

When i run paperwork, there are some error messages on console. It seems that paperwork try to open a nonexistant document 20120104_2143_30 which correspond to the current date.

$ paperwork
No handlers could be found for logger "pycountry.db"
Looking for locales in 'locale/fr/LC_MESSAGES/paperwork.mo' ...
Will use locales from 'locale'
Config file found: /home/chris/.paperwork.conf
Try to used UI file ./mainwindow.glade but failed: L'ouverture du fichier « ./mainwindow.glade » a échoué : Aucun fichier ou dossier de ce type
UI file used: src/mainwindow.glade
Main window resized
Exception while trying to get the number of pages of '20120104_2143_30': [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30'
Showing first page of the doc
Showing page '20120104_2143_30 p1'
Unable to show image for '20120104_2143_30 p1': [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30/paper.1.jpg'
Unable to read [/home/chris/papers/20120104_2143_30/paper.1.txt]: [Errno 2] Aucun fichier ou dossier de ce type: '/home/chris/papers/20120104_2143_30/paper.1.txt'

Detect the languages available

There are a lot of Tesseract languages available. Instead of using a predefined list, it should be generated based on the trained data files available.

Use Gobject and threads

Indexation, scanning, etc should be done in separate threads. Using Gobject, synchronization with the gtk thread could be easily handled.

"Device is busy"

If you do a calibration scan, it is not possible anymore to do a normal scan. If you run Paperwork in a terminal, an exception is raised with the error message "Device is busy".

There are 2 possibilities:

  • Either I misuse the API of the Python module for Sane
  • Or the function device.close() of this module is buggy

Use GIO (Gnome IO)

Using gnome vfs could make things easier on some points: papers could be stored on anything accessible through GIO (samba, ftp, ssh, etc).
Also, it could make the use of the Trash folder easier (see issue 19]

Store the keyword index in a Sqlite database

At startup, paperwork reindex all the documents. It could be a good idea to store keywords in a sqlite3 database. This way, at startup, paperwork would only have to check modifications timestamps on the files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.