Code Monkey home page Code Monkey logo

paperless's Introduction

[ en | de | el ]

Paperless

Important news about the future of this project

It's been more than 5 years since I started this project on a whim as an effort to try to get a handle on the massive amount of paper I was dealing with in relation to various visa applications (expat life is complicated!) Since then, the project has exploded in popularity, so much so that it overwhelmed me and working on it stopped being "fun" and started becoming a serious source of stress.

In an effort to fix this, I created the Paperless GitHub organisation, and brought on a few people to manage the issue and pull request load. Unfortunately, that model has proven to be unworkable too. With 23 pull requests waiting and 157 issues slowly filling up with confused/annoyed people wanting to get their contributions in, my whole "appoint a few strangers and hope they've got time" idea is showing my lack of foresight and organisational skill.

In the shadow of these difficulties, a fork called Paperless-ng written by Jonas Winkler has cropped up. It's really good, and unlike this project, it's actively maintained (at the time of this writing anyway). With 564 forks currently tracked by GitHub, I suspect there are a few more forks worth looking into out there as well.

So, with all of the above in mind, I've decided to archive this project as read-only and suggest that those interested in new updates or submitting patches have a look at Paperless-ng. If you really like "Old Paperless", that's ok too! The project is GPL licensed, so you can fork it and run it on whatever you like so long as you respect the terms of said license.

In time, I may transfer ownership of this organisation to Jonas if he's interested in taking that on, but for the moment, he's happy to run Paperless-ng out of its current repo. Regardless, if we do decide to make the transfer, I'll post a notification here a few months in advance so that people won't be surprised by new code at this location.

For my part, I'm really happy & proud to have been part of this project, and I'm sorry I've been unable to commit more time to it for everyone. I hope you all understand, and I'm really pleased that this work has been able to continue to live and be useful in a new project. Thank you to everyone who contributed, and for making Free software awesome.

Sincerely, Daniel Quinn

Documentation Chat Travis Coverage Status StackShare Thanks

Index and archive all of your scanned paper documents

I hate paper. Environmental issues aside, it's a tech person's nightmare:

  • There's no search feature
  • It takes up physical space
  • Backups mean more paper

In the past few months I've been bitten more than a few times by the problem of not having the right document around. Sometimes I recycled a document I needed (who keeps water bills for two years?) and other times I just lost it... because paper. I wrote this to make my life easier.

How it Works

Paperless does not control your scanner, it only helps you deal with what your scanner produces

  1. Buy a document scanner that can write to a place on your network. If you need some inspiration, have a look at the scanner recommendations page.
  2. Set it up to "scan to FTP" or something similar. It should be able to push scanned images to a server without you having to do anything. Of course if your scanner doesn't know how to automatically upload the file somewhere, you can always do that manually. Paperless doesn't care how the documents get into its local consumption directory.
  3. Have the target server run the Paperless consumption script to OCR the file and index it into a local database.
  4. Use the web frontend to sift through the database and find what you want.
  5. Download the PDF you need/want via the web interface and do whatever you like with it. You can even print it and send it as if it's the original. In most cases, no one will care or notice.

Here's what you get:

The before and after

Documentation

It's all available on ReadTheDocs.

Requirements

This is all really a quite simple, shiny, user-friendly wrapper around some very powerful tools.

  • ImageMagick converts the images between colour and greyscale.
  • Tesseract does the character recognition.
  • Unpaper despeckles and deskews the scanned image.
  • GNU Privacy Guard is used as the encryption backend.
  • Python 3 is the language of the project.
    • Pillow loads the image data as a python object to be used with PyOCR.
    • PyOCR is a slick programmatic wrapper around tesseract.
    • Django is the framework this project is written against.
    • Python-GNUPG decrypts the PDFs on-the-fly to allow you to download unencrypted files, leaving the encrypted ones on-disk.

Project Status

This project has been around since 2015, and there's lots of people using it. For some reason, it's really popular in Germany -- maybe someone over there can clue me in as to why?

I am no longer doing new development on Paperless as it does exactly what I need it to and have since turned my attention to my latest project, Aletheia. However, I'm not abandoning this project. I am happy to field pull requests and answer questions in the issue queue. If you're a developer yourself and want a new feature, float it in the issue queue and/or send me a pull request! I'm happy to add new stuff, but I just don't have the time to do that work myself.

Affiliated Projects

Paperless has been around a while now, and people are starting to build stuff on top of it. If you're one of those people, we can add your project to this list:

Similar Projects

There's another project out there called Mayan EDMS that has a surprising amount of technical overlap with Paperless. Also based on Django and using a consumer model with Tesseract and Unpaper, Mayan EDMS is much more featureful and comes with a slick UI as well, but still in Python 2. It may be that Paperless consumes fewer resources, but to be honest, this is just a guess as I haven't tested this myself. One thing's for certain though, Paperless is a way better name.

Important Note

Document scanners are typically used to scan sensitive documents. Things like your social insurance number, tax records, invoices, etc. While Paperless encrypts the original files via the consumption script, the OCR'd text is not encrypted and is therefore stored in the clear (it needs to be searchable, so if someone has ideas on how to do that on encrypted data, I'm all ears). This means that Paperless should never be run on an untrusted host. Instead, I recommend that if you do want to use it, run it locally on a server in your own home.

Donations

As with all Free software, the power is less in the finances and more in the collective efforts. I really appreciate every pull request and bug report offered up by Paperless' users, so please keep that stuff coming. If however, you're not one for coding/design/documentation, and would like to contribute financially, I won't say no ;-)

The thing is, I'm doing ok for money, so I would instead ask you to donate to the United Nations High Commissioner for Refugees. They're doing important work and they need the money a lot more than I do.

paperless's People

Contributors

addadi avatar ahyear avatar bastianpoe avatar belonias avatar bmsleight avatar ckut avatar dadosch avatar danielquinn avatar ddddavidmartin avatar diveflo avatar ekw avatar elohmeier avatar erikarvstedt avatar grembo avatar jat255 avatar jenspfeifle avatar languitar avatar lawtancool avatar lukaszsolo avatar maphy-psd avatar masterofjokers avatar matthewmoto avatar ovv avatar piotrcichosz avatar pitkley avatar rhaamo avatar sbrunner avatar strubbl avatar tido- avatar tikitu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paperless's Issues

Logging in paperless?

Probably related to #16.

It would be great to implement some sort of logging for the webserver and the consumer, since right now everything is output to stdout, it seems. Perhaps it's just me, but I'm not sure how to capture this for later analysis. I think having some sort of log would also help with a status indicator

Document Categories

Thanks for putting together this tool! After a bit of tooling around, I've got a bare metal install to play with. I was wondering what your thoughts are on creating a new property for Documents for a Category, separate from the tags. That way if you want to separate Documents into "Mail", "Receipts", "Statements", you can do that and perhaps even have different behaviors per Category.

An example of something that I personally would like to see is that for all of my Mail that I scan, I scan the envelope first, then open it up and scan the contents as subsequent pages. It would be cool for the "Mail" category to show a thumbnail of the first page of the PDF (in my case it would be the envelope). For the other Categories, maybe allow the user to upload their own images or whatnot to show for them so that at a glance, the Documents can be quickly gleaned as to which Category they belong to (a briefcase or a lock or whatnot).

If you're not opposed to the idea, I could potentially help implement it, time permitting.

Not consuming on Vagrant

I'm testing this locally on Linux, and can't seem to get consumption to work. Django runs fine, consumer lists no errors, I have a PDF in the consumption dir.

Nothing.

Any gotchas? File size, type? Anything else?

Image upload API ?

Should we have an API to upload image of a bill/receipt. Not everyone has a document scanner I think uploading images would be really handy this way?

Label/Tag

I think it would be a good idea to add some kind of label or tags.
That way one could give the electric bill for example: BILL, ELECTRIC
Or if it is for your mothers computer: BILL, COMPUTER, MOM,..
Some bills might be too long for a single scan and that way it is possible to group those a little at least.

With a little machine learning that could be done automatically I suppose.

PARSER_REGEX and PARSER_REGEX_TITLE should be case-insensitive.

It took me a little while to figure out why the consumer was ignoring my PDFs, I moved them directly from my Brother DSmobile 920-DW scanner, which by default has the file extension as .PDF, not .pdf.

Thus they fail this check in the document_consumer script:

        if not re.match(self.PARSER_REGEX_TITLE, pdf):
            continue

Introduce a means of detecting & changing OCR language

PR #7 starts paperless down the path of supporting multiple languages, but I'd like to get it to the point where we can scan documents of different languages and have it automatically choose the appropriate language parser for OCR. I have no idea how to do this of course, so ideas/suggestions are welcome.

Feature request: Status indication in web gui

It would be great if there were some sort of indication in the GUI as to what was happening at any given time. i.e. ''Idle", "Consuming name of file X of N pages completed", etc.

Unfortunately I have no experience with django, so this is really just a request, as I don't think I'd be able to implement something like this myself.

FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/convert'

Hi, when I run ./manage.py document_consumer this come out (OS X El Capitan)
Traceback (most recent call last): File "./manage.py", line 18, in <module> execute_from_command_line(sys.argv) File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 350, in execute_from_command_line utility.execute() File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 342, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 348, in run_from_argv self.execute(*args, **cmd_options) File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 399, in execute output = self.handle(*args, **options) File "/Users/Francesco/Desktop/paperless-master/src/documents/management/commands/document_consumer.py", line 49, in handle self.loop() File "/Users/Francesco/Desktop/paperless-master/src/documents/management/commands/document_consumer.py", line 59, in loop self.file_consumer.consume() File "/Users/Francesco/Desktop/paperless-master/src/documents/consumer.py", line 116, in consume pngs = self._get_greyscale(tempdir, doc) File "/Users/Francesco/Desktop/paperless-master/src/documents/consumer.py", line 139, in _get_greyscale "-type", "grayscale", doc, png File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 950, in __init__ restore_signals, start_new_session) File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 1544, in _execute_child raise child_exception_type(errno_num, err_msg) FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/convert'

I've installed pip3 requirements from requirements.txt.

Server setup via Docker

It would be really cool to have a small Docker script that allows this project to be built and deployed behind an Apache or nginx server; that way a person could simply drop into onto a cheap host somewhere in the cloud and have the service accessible to them from everywhere...

case of document.contents for document_retagger, etc.

Hi Daniel!

It appears that the document contents are stored in mixed case but all tags are stored and matched in lower case. Shouldn't the contents be stored as lower case?

I love this project!!! I'm moving all my PDF files into it. At the moment I only have about 1500 in there and things seem to be running okay. I wonder how many will make sqlite start to crack.

Integrate dotenv

I discovered dotenv today and it's doing exactly what I was trying to figure out how to do elegantly on my own. No sense in reinventing the wheel.

Data loss when exporting documents with duplicate titles

The export script writes documents into a directory without checking for name collisions. This means that if I upload two documents with the same title, they will produce distinct entries in the database, but the export will clobber one of them.

I'm happy to tackle fixing this (I have quite some more experience with python and django than with docker :-) ), but wanted to get your pre-approval for the approach I would take. I thought to add a timestamp prefix to the export, so that files come out looking like 2016-02-13T23:15:02Z - Title - tag,tag.pdf. It's not 100% watertight, but it seems pretty much good enough to me. It also implies the consumer has to recognise this format -- that has one nice side effect, namely that exporting docs and re-importing them will also preserve the timestamps at which they were originally imported.

If you prefer a different approach, I'm happy to take it on if you can give me a sketch, or to brainstorm alternative ideas here. I actually need something like this, because one of my use cases for paperless is storing documents that come in at regular intervals and will always get the same name (e.g. the electricity bill).

  • Are filenames with .isoformat() datetimes scp-compatible? (There is some concern about the : character.)

Consumption directory /Users/me/PAPERLESS_CONSUMPTION does not exist

Hi!
I've got this setup using Vagrant. The web server runs, the settings I've set in settings.py for user and password work. BUT, when I run /opt/paperless/src/manage.py document_consumer I get a warning that it can't find my PAPERLESS_CONSUMPTION_DIR. It is in my Home folder, I've set it to be read/writeable by all, etc. What am I doing wrong?? Thank you, this seems like awesome software.

Deskew/Despeckle

Often times when scanning documents (especially in bulk) you run into situations where the document is slightly skewed and/or has speckles that throw off OCR.

skew
speckle

Make Logging Suck Less

The logging facility I introduced in #16 is crude and doesn't make use of Python's native logging. I did this because I find Python's native logging facility confusing, but I hear that it can do what I want.

So, here's what I think logging should do:

  • You should be able to call it at arbitrary times, giving each message a level like debug, error, warning, etc.
  • You should be able to watch the logging output in a stream, like tail -f /path/to/file
  • You should be able to do REST API calls to get something like "the last 5 messages of level >=info".

The problem for me is that I can't figure out how to solve logging and still allow for the second and third point above. Any advice on this is greatly appreciated.

For tag matching using non space delimiter so can match strings containing spaces

eg: "credit card"
if I search for credit and card seperately i get lots of matches that aren't relavent.
Similar if I search for a company name with common words eg: "Direct Line".

I had a look at regex expressions to match a string with spaces for example the two above, in order.I couldn't find any obvious examples. And it'd seem a bit overkill. if you use a comma , or pipe | might be better seperators.

Also related, I could only see the help text and the fact space delimitation was used by looking at the code.

Elasticsearch/Lucene/Solr

While Postgres does have fantastic fulltext search, might I suggest using a purpose built search engine?

Just some feedback

Hi @danielquinn, thank you for paperless, I like it ๐Ÿ˜„

Here is a bit of feedback !

About usage

Would it be possible to have a single command to launch both the admin interface and the document consumer at once? I wrote a shell script to do this but if I send one of the two in background then I can't manage the processes properly.

Some issues

  • A langdetect exception:

Traceback (most recent call last):
File "./manage.py", line 18, in
execute_from_command_line(sys.argv)
File "django/core/management/init.py", line 350, in execute_from_command_line
utility.execute()
File "django/core/management/init.py", line 342, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "django/core/management/base.py", line 348, in run_from_argv
self.execute(_args, *_cmd_options)
File "django/core/management/base.py", line 399, in execute
output = self.handle(_args, *_options)
File "paperless/src/documents/management/commands/document_consumer.py", line 68, in handle
self.loop()
File "paperless/src/documents/management/commands/document_consumer.py", line 98, in loop
text = self._get_ocr(pngs)
File "paperless/src/documents/management/commands/document_consumer.py", line 161, in _get_ocr
guessed_language = langdetect.detect(raw_text)
File "langdetect/detector_factory.py", line 130, in detect
return detector.detect()
File "langdetect/detector.py", line 135, in detect
probabilities = self.get_probabilities()
File "langdetect/detector.py", line 142, in get_probabilities
self._detect_block()
File "langdetect/detector.py", line 149, in _detect_block
raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
langdetect.lang_detect_exception.LangDetectException: No features in text.

  • Swapped sender and title:
    I noticed that the sender (an author name for example, written in the first lines of a PDF), when found, was saved as Title and that the file name was saved as Sender!

That's it.

Timestamped and searchable logs

Hi everyone!

I'm trying to track down what may be data loss on import. I am importing around 20000 PDFs which are numbered by sequence. I have noticed that some of the sequences are broken but the overall total in the database appears to be okay.

Having searchable timestamped logs would really help in identifying possible data loss.

Thanks!

Tag on whole words

Hi Daniel,

I want to tag documents by year but I have a bank account with 2002 in the number. You can guess what happens ;-)

I patched Tag.matches to match only on whole words using this:

text = re.sub(r'\W+', ' ', text.lower()).split()

I wonder if it would be useful to add an option for whole word matching.

Cheers,
Jason.

Feature request: Text context in search results

Another useful feature might be having a column in the data table that shows up when you do a search, and shows a little bit of context from the text contents on the front page (so you know which file to download). This would be useful particularly if your documents don't have particularly descriptive titles.

Which NAS would you recommend?

Thank you for this awesome project! I am asking here, because there exists no support forum etc. for paperless. I hope my issue does not interfere too much.

I am thinking about buying a NAS with RAID 1 for paperless. However I am not really into running custom code on a NAS nor I have a NAS.
So I want to ask, if someone has paperless running on his NAS and/or could recommend a simple NAS for paperless with two HDDs (RAID 1). :)

Thanks a lot.

A Better Front-End

The one we have right now is the one that comes free with Django. It's great for what it's supposed to do, but it was never meant as a primary front-end.

Now that there's a proper REST API, we need a real front-end in the one-page-app style.

user to login

Hi.

I try to follow your readme but "python manage.py consume" does not ecist and I don't know which user to use to login.

Kind regards,
Giovanni

No module named 'django'

I'm trying to initialise the database, but I keep getting this error:

-> ./manage.py createsuperuser
Traceback (most recent call last):
  File "./manage.py", line 8, in <module>
    from django.conf import settings
ImportError: No module named 'django'

I installed requirements and I've changed PASSPHRASE in settings.py. Although I'm not sure what CONSUMPTION_DIR needs to be changed to

document_consumer does not ask for passphrase

I'm currently running it with no PASSPHRASE in settings.py. However, manage.py checks for the wrong command name:

if "runserver" in sys.argv or "consume" in sys.argv:

This doesn't cause any visible errors, but all stored PDFs are 0 bytes.

Quick fix: make it check for "document_consumer" instead.

Full text search

What do guys think about implementing a full text search ? It would be very helpful if one can search through all the stored bills/receipts? How about using elasticsearch ?

Had some JPGs in same directory as PDFs, but cannot view them afterwards

The JPGs seem to have been processed fine, I can see the content from OCR etc.

But when I click on the (404ed/broken) attachment icon I get an error (copy + pasted below).
Could it be because the scanner saved automatically in capital letters, JPG rather than jpg/jpeg ?

Let me know if there is anything else I can help with.

Environment:

Request Method: GET
Request URL: http://192.168.3.1:8100/fetch/19

Django Version: 1.9
Python Version: 3.4.3
Installed Applications:
['django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'django_extensions',
'documents',
'logger']
Installed Middleware:
['django.middleware.security.SecurityMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.common.CommonMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.auth.middleware.SessionAuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'django.middleware.clickjacking.XFrameOptionsMiddleware']

Traceback:

File "/home/seb/.local/lib/python3.4/site-packages/django/core/handlers/base.py" in get_response

  1.                 response = self.process_exception_by_middleware(e, request)
    

File "/home/seb/.local/lib/python3.4/site-packages/django/core/handlers/base.py" in get_response

  1.                 response = wrapped_callback(request, _callback_args, *_callback_kwargs)
    

File "/home/seb/.local/lib/python3.4/site-packages/django/views/generic/base.py" in view

  1.         return self.dispatch(request, _args, *_kwargs)
    

File "/home/seb/.local/lib/python3.4/site-packages/django/views/generic/base.py" in dispatch

  1.     return handler(request, _args, *_kwargs)
    

File "/home/seb/.local/lib/python3.4/site-packages/django/views/generic/detail.py" in get

  1.     return self.render_to_response(context)
    

File "/home/seb/.paperless/src/documents/views.py" in render_to_response

  1.         content_type=content_types[self.object.file_type]
    

Exception Type: KeyError at /fetch/19
Exception Value: 'JPG'

How does one change document naming after it's consumed?

Unfortunately the muti function I will be using for a bulk of my documents due to ADF is rather silly in it's Scan to FTP options. Static prefix + document counter . pdf. I saw the documentation on the retagging which would apply here [as docs increase I'll need to add tags for organization] It appears I can change the Sender / File name from the web ui as well but with volume that might be rather painful and was hoping I missed something or there are other suggestions. Granted this won't be as much of an issue if I was to scan documents as they come in but since I'm testing how things work I have a backlog [file cabinet full] that I'd like to digitize just to have access to them.

I'm considered a pre consumer step to name the files better first but I am trying to come to a solution that limits the manual work required. OCR text may be enough to identify repetitive things [utility bill etc].
It would be prefer to not have to manually rename every file consumed.

Authentication on the API

The REST API is there, but it's very limited. There's no authentication, no permissions checking, etc. I'm not even sure what will happen if you issue a POST or PUT request on any of the three access points (sender, tag, and document).

I'd like to introduce oauth2 and restrict read and write requests to authenticated users.

Make sure file is preserved on import failure

In issue #48, when the import failed, the file was still deleted.

Files should never be deleted on any type of failure as there is nothing to help with diagnostics.

Perhaps even imported files should be moved to another directory for a day or two?

Thanks!
Jason.

AttributeError: __exit__ when using document_consumer

Very cool tool! I ran into the following bug when I ran (using Ubuntu 14):

$ ./manage.py document_consumer
Consuming /home/phi/Dropbox/Scans/New Doc 1_1.pdf
Traceback (most recent call last):
  File "./manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/__init__.py", line 350, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/__init__.py", line 342, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 68, in handle
    self.loop()
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 98, in loop
    text = self._get_ocr(pngs)
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 159, in _get_ocr
    raw_text = self._ocr(pngs, self.DEFAULT_OCR_LANGUAGE)
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 201, in _ocr
    with Image.open(os.path.join(self.SCRATCH, png)) as f:
AttributeError: __exit__

What fixed the issue for me is getting rid of the with Image.open statement on these lines to:

            f = Image.open(os.path.join(self.SCRATCH, png))
            self._render("    {}".format(f.filename), 3)
            r += self.OCR.image_to_string(f, lang=lang)

I didn't want to submit a pull request since it seems like no one else had this issue. Otherwise everything is great, thank!!

convert: no images defined

On Mac, OS X Yosemite 10.10.3

Not sure if relevant, I installed imagemagick via brew and so likewise changed the path to the corresponding CONVERT_BINARY in settings.py to "/usr/local/bin/convert"... even after clearing cache, this is an example of the output from the document_consumer script:

Consuming /Users/miles/Admin/Scanned/IMG_0001.pdf
convert: no images defined `/tmp/paperless/2623688.png' @ error/convert.c/ConvertImageCommand/3241.
Traceback (most recent call last):
  File "./manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 350, in execute_from_command_line
    utility.execute()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 342, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/Users/miles/Admin/paperless/src/documents/management/commands/document_consumer.py", line 68, in handle
    self.loop()
  File "/Users/miles/Admin/paperless/src/documents/management/commands/document_consumer.py", line 98, in loop
    text = self._get_ocr(pngs)
  File "/Users/miles/Admin/paperless/src/documents/management/commands/document_consumer.py", line 161, in _get_ocr
    guessed_language = langdetect.detect(raw_text)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector_factory.py", line 130, in detect
    return detector.detect()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector.py", line 135, in detect
    probabilities = self.get_probabilities()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector.py", line 142, in get_probabilities
    self._detect_block()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector.py", line 149, in _detect_block
    raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
langdetect.lang_detect_exception.LangDetectException: No features in text.

A cursory glance at the /tmp/paperless/ directory also confirms that there are no files there. Does that directory need particular permissions for the PNGs to be created?

Make this a proper django package

Hello, thanks for the nice project!

Would you consider making a proper django package out of paperless for easier integration with other django projects? You'll just need to create a django-paperless (or something like that) package that contains (more or less) the documents app of this project and add it as a dependency.

Thanks!

Screenshot/demo

I want to try this somewhen soon after looking at the README - but a screenshot or a live demo would really help getting an impression how this looks like ๐Ÿ˜‰

Have a way to re-apply tag's search terms on already imported documents

Only after importing 100+ documents did i notice you could setup tags with search words to automatically tag newly processed documents.

When you come round to doing the UI, might be a good to include information on this.

Alternatively the ability to re-run all the tagging rules on existing documents would be very useful.

Consumer crashes when reading files

Documents were scanned with Simple Scan in Arch Linux with a Bare Metal "install". The webserver is running fine. I edited the settings.py as per the setup instructions. OCR_THREADS is set to "2" (I'm running an i7).

I don't know much about python, so please tell me what other infos I can provide.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 237, in image_to_string
    ocr = pyocr.get_available_tools()[0]
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/__init__.py", line 350, in execute_from_command_line
    utility.execute()
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/__init__.py", line 342, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/home/mariachi/bin/paperless/src/documents/management/commands/document_consumer.py", line 49, in handle
    self.loop()
  File "/home/mariachi/bin/paperless/src/documents/management/commands/document_consumer.py", line 59, in loop
    self.file_consumer.consume()
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 113, in consume
    text = self._get_ocr(pngs)
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 173, in _get_ocr
    raw_text = self._ocr([pngs[middle]], self.DEFAULT_OCR_LANGUAGE)
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 229, in _ocr
    self.image_to_string, itertools.product(pngs, [lang]))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
IndexError: list index out of range

Thumbnails

With all of the interest in making a proper UI, I figure I should setup some means of providing a thumbnail for each document.

Gitter chat

This looks like a great project. I'm working on getting it setup on one of my home servers, but have run into a couple questions. Is there (or should there be) a gitter chat for the project, or maybe and IRC channel? That could help get new users into the swing of things a little easier.

Continuous Integration for Docker & Vagrant builds

Given both the rapid growth of this project and general convenience, I think a Continuous Integration solution is in order. It would help everyone by:

  1. showing the end-user in the README if the current master builds (which it should!),
  2. checking every PR if it builds correctly, and informing the requester if it didn't,
  3. not allowing a merge of a failed PR (configurable).

Further this should take at least some burden off of @danielquinn's shoulders since a baseline check will already have been performed on the submitted changes.

From what I've seen, Travis CI seems to be the de facto standard on GitHub for automatically building open-source projects for free. There are some alternatives (like drone.io) but they usually don't allow us to install our requirements (like ImageMagick) or have other limitations. While a system like Jenkins could definitely cover all use cases, and I use it personally with great success, it requires a dedicated system to run on.

Travis CI is controlled by a .travis.yml file which has to be located in the root of the repository. This file is a specification for Travis on what steps to perform and what commands to execute in order to build and test an application. The build-environment can be a full VM based on Ubuntu 12.04 or Ubuntu 14.04 or it can be container-based.

I have created a simple .travis.yml to test if and how it works, see feature/travis-ci. I have managed to build Paperless successfully. This build was based on my Docker implementation feature/dockerfile and tested both the building of the Docker image and running of the tests under both Python 3.4 and 3.5.

Building and testing the application directly in the Travis VM without using Docker is of course possible, and Travis allows to test multiple Python-versions automatically. But (a) from what I've seen combining the Travis Python tests with the Docker tests is a hassle and (b) the environment we build up within the Docker container is identical to the one we would build up in the VM.

Ignoring the Dockerfile altogether is of course an option. While this would simplify installing and testing directly in the Travis VM, it would not guarantee that a pull request doesn't break the Dockerfile, which in my eyes should be covered by CI! (Even testing Vagrant seems to be an option.)


Summarized, those seem to be the points for discussion:

  • Should a CI system be introduced?
  • What exactly are the requirements for the CI?
  • Is Travis CI a decent choice? Are there better alternatives?
  • Do we want to cover the Dockerfile and Vagrantfile?

Hybridise PDFs with combined OCR'd text

OCRFeeder is a application around various recognition backends. It does the page segmentation (find sections containing text), feeds them to the OCR engine of choice and provide options like combined PDF (a pdf containing the plaintext below the original document), plaintext, odt and others.

Its written in python, so it should be possible to integrate the segmentation and export code with paperless.

Automatic tagging

Now that we have tags, it'd be nice to have the ability for a user to say something like:

If a document is indexed containing arbitrary words, tag it with tag.

I think it'd be nice to support regular expressions and/or allow for simple logic like "all of these words" vs. "some of these words".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.