the-paperless-project / paperless Goto Github PK

Scan, index, and archive all of your paper documents

License: GNU General Public License v3.0

Python 27.38% Shell 0.56% CSS 14.11% HTML 4.64% JavaScript 17.61% Dockerfile 0.25% PostScript 15.94% Less 6.50% SCSS 13.00%

documents paper ocr archiving search

paperless's People

Contributors

Stargazers

Watchers

Forkers

a314s artonmurati jwcmitchell rhyspowell m1yag1 changeyourname ashokpant techscientist the01 avichalp waynew jat255 gitter-badger mrwacky42 yosoyubik rkaramc orinocoz danielshir applied-duality cartersgenes driky jshakespear johnmccabe konoti emanuele uikit0 leroyg cekisakurek mkolod neuroradiology zixan kaleidicforks esaul georgewhewell droctogon qbektrix alokkulkarni manuelportela jensworkgit shelltips bradparks wttw netconstructor xueyumusic apfeiffer1 ricardokirkner anirudhvarma12 pitkley h1net johndonner hacknuts endika crono phaufe tuefekci olmady brandon-o xanderdwyl tikitu vixfive disunbow scotu zedster cryptogir jasonnic cloudxtreme scarroll32 stevenvandervalk stringlytyped djmaze jaimeobregon jobridts janlo wflk samjaninf knl phryneas anishihara number0 rhyllew tom-zeit dynamicmetaflow askz jbogatay priestd09 rubenwaterman amitkumar3968 ystallonne sshyran mmcduffie1 synchrone ckut nimonimonimo sytone crisson schober-ch xsteadfastx garrettcadams billyprice1old digideskio

paperless's Issues

Feature request: Status indication in web gui

It would be great if there were some sort of indication in the GUI as to what was happening at any given time. i.e. ''Idle", "Consuming name of file X of N pages completed", etc.

Unfortunately I have no experience with django, so this is really just a request, as I don't think I'd be able to implement something like this myself.

Authentication on the API

The REST API is there, but it's very limited. There's no authentication, no permissions checking, etc. I'm not even sure what will happen if you issue a POST or PUT request on any of the three access points (sender, tag, and document).

I'd like to introduce oauth2 and restrict read and write requests to authenticated users.

FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/convert'

Hi, when I run ./manage.py document_consumer this come out (OS X El Capitan)
Traceback (most recent call last): File "./manage.py", line 18, in <module> execute_from_command_line(sys.argv) File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 350, in execute_from_command_line utility.execute() File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 342, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 348, in run_from_argv self.execute(*args, **cmd_options) File "/Users/Francesco/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 399, in execute output = self.handle(*args, **options) File "/Users/Francesco/Desktop/paperless-master/src/documents/management/commands/document_consumer.py", line 49, in handle self.loop() File "/Users/Francesco/Desktop/paperless-master/src/documents/management/commands/document_consumer.py", line 59, in loop self.file_consumer.consume() File "/Users/Francesco/Desktop/paperless-master/src/documents/consumer.py", line 116, in consume pngs = self._get_greyscale(tempdir, doc) File "/Users/Francesco/Desktop/paperless-master/src/documents/consumer.py", line 139, in _get_greyscale "-type", "grayscale", doc, png File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 950, in __init__ restore_signals, start_new_session) File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 1544, in _execute_child raise child_exception_type(errno_num, err_msg) FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/convert'

I've installed pip3 requirements from requirements.txt.

Error with dockerfile

I try pull code and setup with Docker but it has error , how I can fix it ?

Introduce a means of detecting & changing OCR language

PR #7 starts paperless down the path of supporting multiple languages, but I'd like to get it to the point where we can scan documents of different languages and have it automatically choose the appropriate language parser for OCR. I have no idea how to do this of course, so ideas/suggestions are welcome.

Tag on whole words

Hi Daniel,

I want to tag documents by year but I have a bank account with 2002 in the number. You can guess what happens ;-)

I patched Tag.matches to match only on whole words using this:

text = re.sub(r'\W+', ' ', text.lower()).split()

I wonder if it would be useful to add an option for whole word matching.

Cheers,
Jason.

Not consuming on Vagrant

I'm testing this locally on Linux, and can't seem to get consumption to work. Django runs fine, consumer lists no errors, I have a PDF in the consumption dir.

Nothing.

Any gotchas? File size, type? Anything else?

Have a way to re-apply tag's search terms on already imported documents

Only after importing 100+ documents did i notice you could setup tags with search words to automatically tag newly processed documents.

When you come round to doing the UI, might be a good to include information on this.

Alternatively the ability to re-run all the tagging rules on existing documents would be very useful.

Label/Tag

I think it would be a good idea to add some kind of label or tags.
That way one could give the electric bill for example: BILL, ELECTRIC
Or if it is for your mothers computer: BILL, COMPUTER, MOM,..
Some bills might be too long for a single scan and that way it is possible to group those a little at least.

With a little machine learning that could be done automatically I suppose.

Logging in paperless?

Probably related to #16.

It would be great to implement some sort of logging for the webserver and the consumer, since right now everything is output to stdout, it seems. Perhaps it's just me, but I'm not sure how to capture this for later analysis. I think having some sort of log would also help with a status indicator

Make this a proper django package

Hello, thanks for the nice project!

Would you consider making a proper django package out of paperless for easier integration with other django projects? You'll just need to create a django-paperless (or something like that) package that contains (more or less) the documents app of this project and add it as a dependency.

Thanks!

A Better Front-End

The one we have right now is the one that comes free with Django. It's great for what it's supposed to do, but it was never meant as a primary front-end.

Now that there's a proper REST API, we need a real front-end in the one-page-app style.

Had some JPGs in same directory as PDFs, but cannot view them afterwards

The JPGs seem to have been processed fine, I can see the content from OCR etc.

But when I click on the (404ed/broken) attachment icon I get an error (copy + pasted below).
Could it be because the scanner saved automatically in capital letters, JPG rather than jpg/jpeg ?

Let me know if there is anything else I can help with.

Environment:

Request Method: GET
Request URL: http://192.168.3.1:8100/fetch/19

Django Version: 1.9
Python Version: 3.4.3
Installed Applications:
['django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'django_extensions',
'documents',
'logger']
Installed Middleware:
['django.middleware.security.SecurityMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.common.CommonMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.auth.middleware.SessionAuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'django.middleware.clickjacking.XFrameOptionsMiddleware']

Traceback:

File "/home/seb/.local/lib/python3.4/site-packages/django/core/handlers/base.py" in get_response

                response = self.process_exception_by_middleware(e, request)

File "/home/seb/.local/lib/python3.4/site-packages/django/core/handlers/base.py" in get_response

                response = wrapped_callback(request, _callback_args, *_callback_kwargs)

File "/home/seb/.local/lib/python3.4/site-packages/django/views/generic/base.py" in view

        return self.dispatch(request, _args, *_kwargs)

File "/home/seb/.local/lib/python3.4/site-packages/django/views/generic/base.py" in dispatch

    return handler(request, _args, *_kwargs)

File "/home/seb/.local/lib/python3.4/site-packages/django/views/generic/detail.py" in get

    return self.render_to_response(context)

File "/home/seb/.paperless/src/documents/views.py" in render_to_response

        content_type=content_types[self.object.file_type]

Exception Type: KeyError at /fetch/19
Exception Value: 'JPG'

requirements.txt

Please list the requirements to get this project running

Harmonise environment variable names with constant names

As per this discussion, we need to bring a sense of order to the names of environment variables vs. constant names. At the same time, I'd like to set CONVERT_BINARY = os.environ("PAPERLESS_CONVERT_BINARY", "convert") rather than the default of /usr/bin/convert which is distro-specific.

Image upload API ?

Should we have an API to upload image of a bill/receipt. Not everyone has a document scanner I think uploading images would be really handy this way?

Server setup via Docker

It would be really cool to have a small Docker script that allows this project to be built and deployed behind an Apache or nginx server; that way a person could simply drop into onto a cheap host somewhere in the cloud and have the service accessible to them from everywhere...

AttributeError: exit when using document_consumer

Very cool tool! I ran into the following bug when I ran (using Ubuntu 14):

$ ./manage.py document_consumer
Consuming /home/phi/Dropbox/Scans/New Doc 1_1.pdf
Traceback (most recent call last):
  File "./manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/__init__.py", line 350, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/__init__.py", line 342, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.4/dist-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 68, in handle
    self.loop()
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 98, in loop
    text = self._get_ocr(pngs)
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 159, in _get_ocr
    raw_text = self._ocr(pngs, self.DEFAULT_OCR_LANGUAGE)
  File "/home/phi/Downloads/paperless/src/documents/management/commands/document_consumer.py", line 201, in _ocr
    with Image.open(os.path.join(self.SCRATCH, png)) as f:
AttributeError: __exit__

What fixed the issue for me is getting rid of the with Image.open statement on these lines to:

            f = Image.open(os.path.join(self.SCRATCH, png))
            self._render("    {}".format(f.filename), 3)
            r += self.OCR.image_to_string(f, lang=lang)

I didn't want to submit a pull request since it seems like no one else had this issue. Otherwise everything is great, thank!!

Consumer crashes when reading files

Documents were scanned with Simple Scan in Arch Linux with a Bare Metal "install". The webserver is running fine. I edited the settings.py as per the setup instructions. OCR_THREADS is set to "2" (I'm running an i7).

I don't know much about python, so please tell me what other infos I can provide.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 237, in image_to_string
    ocr = pyocr.get_available_tools()[0]
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/__init__.py", line 350, in execute_from_command_line
    utility.execute()
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/__init__.py", line 342, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/mariachi/.local/lib/python3.5/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/home/mariachi/bin/paperless/src/documents/management/commands/document_consumer.py", line 49, in handle
    self.loop()
  File "/home/mariachi/bin/paperless/src/documents/management/commands/document_consumer.py", line 59, in loop
    self.file_consumer.consume()
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 113, in consume
    text = self._get_ocr(pngs)
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 173, in _get_ocr
    raw_text = self._ocr([pngs[middle]], self.DEFAULT_OCR_LANGUAGE)
  File "/home/mariachi/bin/paperless/src/documents/consumer.py", line 229, in _ocr
    self.image_to_string, itertools.product(pngs, [lang]))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
IndexError: list index out of range

No module named 'django'

I'm trying to initialise the database, but I keep getting this error:

-> ./manage.py createsuperuser
Traceback (most recent call last):
  File "./manage.py", line 8, in <module>
    from django.conf import settings
ImportError: No module named 'django'

I installed requirements and I've changed PASSPHRASE in settings.py. Although I'm not sure what CONSUMPTION_DIR needs to be changed to

Timestamped and searchable logs

Hi everyone!

I'm trying to track down what may be data loss on import. I am importing around 20000 PDFs which are numbered by sequence. I have noticed that some of the sequences are broken but the overall total in the database appears to be okay.

Having searchable timestamped logs would really help in identifying possible data loss.

Thanks!

Full text search

What do guys think about implementing a full text search ? It would be very helpful if one can search through all the stored bills/receipts? How about using elasticsearch ?

Thumbnails

With all of the interest in making a proper UI, I figure I should setup some means of providing a thumbnail for each document.

How does one change document naming after it's consumed?

Unfortunately the muti function I will be using for a bulk of my documents due to ADF is rather silly in it's Scan to FTP options. Static prefix + document counter . pdf. I saw the documentation on the retagging which would apply here [as docs increase I'll need to add tags for organization] It appears I can change the Sender / File name from the web ui as well but with volume that might be rather painful and was hoping I missed something or there are other suggestions. Granted this won't be as much of an issue if I was to scan documents as they come in but since I'm testing how things work I have a backlog [file cabinet full] that I'd like to digitize just to have access to them.

I'm considered a pre consumer step to name the files better first but I am trying to come to a solution that limits the manual work required. OCR text may be enough to identify repetitive things [utility bill etc].
It would be prefer to not have to manually rename every file consumed.

Data loss when exporting documents with duplicate titles

The export script writes documents into a directory without checking for name collisions. This means that if I upload two documents with the same title, they will produce distinct entries in the database, but the export will clobber one of them.

I'm happy to tackle fixing this (I have quite some more experience with python and django than with docker :-) ), but wanted to get your pre-approval for the approach I would take. I thought to add a timestamp prefix to the export, so that files come out looking like 2016-02-13T23:15:02Z - Title - tag,tag.pdf. It's not 100% watertight, but it seems pretty much good enough to me. It also implies the consumer has to recognise this format -- that has one nice side effect, namely that exporting docs and re-importing them will also preserve the timestamps at which they were originally imported.

If you prefer a different approach, I'm happy to take it on if you can give me a sketch, or to brainstorm alternative ideas here. I actually need something like this, because one of my use cases for paperless is storing documents that come in at regular intervals and will always get the same name (e.g. the electricity bill).

Are filenames with .isoformat() datetimes scp-compatible? (There is some concern about the : character.)

auto rotate landscape documents

Hi,
Is it possible to detect document orientation before OCR? I have tried with a landscape document and i got no valid text out of it, i suppose manually rotate the file before consume will fix it but this is a pain for an automated scan to ocr.

https://tesseract-ocr.github.io/a01281.html#ga16c4b28cadc2160bf18b84c3f897a2d2

user to login

Hi.

I try to follow your readme but "python manage.py consume" does not ecist and I don't know which user to use to login.

Kind regards,
Giovanni

How About Celery for do the hard work?

Hey,

I used celery for long and hard task using workers in some projects, do you guys think that will fit well in this project?

http://www.celeryproject.org/

Gitter chat

This looks like a great project. I'm working on getting it setup on one of my home servers, but have run into a couple questions. Is there (or should there be) a gitter chat for the project, or maybe and IRC channel? That could help get new users into the swing of things a little easier.

document_consumer does not ask for passphrase

I'm currently running it with no PASSPHRASE in settings.py. However, manage.py checks for the wrong command name:

if "runserver" in sys.argv or "consume" in sys.argv:

This doesn't cause any visible errors, but all stored PDFs are 0 bytes.

Quick fix: make it check for "document_consumer" instead.

Integrate dotenv

I discovered dotenv today and it's doing exactly what I was trying to figure out how to do elegantly on my own. No sense in reinventing the wheel.

Continuous Integration for Docker & Vagrant builds

Given both the rapid growth of this project and general convenience, I think a Continuous Integration solution is in order. It would help everyone by:

showing the end-user in the README if the current master builds (which it should!),
checking every PR if it builds correctly, and informing the requester if it didn't,
not allowing a merge of a failed PR (configurable).

Further this should take at least some burden off of @danielquinn's shoulders since a baseline check will already have been performed on the submitted changes.

From what I've seen, Travis CI seems to be the de facto standard on GitHub for automatically building open-source projects for free. There are some alternatives (like drone.io) but they usually don't allow us to install our requirements (like ImageMagick) or have other limitations. While a system like Jenkins could definitely cover all use cases, and I use it personally with great success, it requires a dedicated system to run on.

Travis CI is controlled by a .travis.yml file which has to be located in the root of the repository. This file is a specification for Travis on what steps to perform and what commands to execute in order to build and test an application. The build-environment can be a full VM based on Ubuntu 12.04 or Ubuntu 14.04 or it can be container-based.

I have created a simple .travis.yml to test if and how it works, see feature/travis-ci. I have managed to build Paperless successfully. This build was based on my Docker implementation feature/dockerfile and tested both the building of the Docker image and running of the tests under both Python 3.4 and 3.5.

Building and testing the application directly in the Travis VM without using Docker is of course possible, and Travis allows to test multiple Python-versions automatically. But (a) from what I've seen combining the Travis Python tests with the Docker tests is a hassle and (b) the environment we build up within the Docker container is identical to the one we would build up in the VM.

Ignoring the Dockerfile altogether is of course an option. While this would simplify installing and testing directly in the Travis VM, it would not guarantee that a pull request doesn't break the Dockerfile, which in my eyes should be covered by CI! (Even testing Vagrant seems to be an option.)

Summarized, those seem to be the points for discussion:

Should a CI system be introduced?
What exactly are the requirements for the CI?
Is Travis CI a decent choice? Are there better alternatives?
Do we want to cover the Dockerfile and Vagrantfile?

Would ZeroDB work for storage?

if someone has ideas on how to do that on encrypted data

That's a claim that zerodb makes. Might be something to take a look at :)

Automatic tagging

Now that we have tags, it'd be nice to have the ability for a user to say something like:

If a document is indexed containing arbitrary words, tag it with tag.

I think it'd be nice to support regular expressions and/or allow for simple logic like "all of these words" vs. "some of these words".

Which NAS would you recommend?

Thank you for this awesome project! I am asking here, because there exists no support forum etc. for paperless. I hope my issue does not interfere too much.

I am thinking about buying a NAS with RAID 1 for paperless. However I am not really into running custom code on a NAS nor I have a NAS.
So I want to ask, if someone has paperless running on his NAS and/or could recommend a simple NAS for paperless with two HDDs (RAID 1). :)

Thanks a lot.

Make sure file is preserved on import failure

In issue #48, when the import failed, the file was still deleted.

Files should never be deleted on any type of failure as there is nothing to help with diagnostics.

Perhaps even imported files should be moved to another directory for a day or two?

Thanks!
Jason.

Elasticsearch/Lucene/Solr

While Postgres does have fantastic fulltext search, might I suggest using a purpose built search engine?

Feature request: Text context in search results

Another useful feature might be having a column in the data table that shows up when you do a search, and shows a little bit of context from the text contents on the front page (so you know which file to download). This would be useful particularly if your documents don't have particularly descriptive titles.

case of document.contents for document_retagger, etc.

Hi Daniel!

It appears that the document contents are stored in mixed case but all tags are stored and matched in lower case. Shouldn't the contents be stored as lower case?

I love this project!!! I'm moving all my PDF files into it. At the moment I only have about 1500 in there and things seem to be running okay. I wonder how many will make sqlite start to crack.

Hybridise PDFs with combined OCR'd text

OCRFeeder is a application around various recognition backends. It does the page segmentation (find sections containing text), feeds them to the OCR engine of choice and provide options like combined PDF (a pdf containing the plaintext below the original document), plaintext, odt and others.

Its written in python, so it should be possible to integrate the segmentation and export code with paperless.

Deskew/Despeckle

Often times when scanning documents (especially in bulk) you run into situations where the document is slightly skewed and/or has speckles that throw off OCR.

Consumption directory /Users/me/PAPERLESS_CONSUMPTION does not exist

Hi!
I've got this setup using Vagrant. The web server runs, the settings I've set in settings.py for user and password work. BUT, when I run /opt/paperless/src/manage.py document_consumer I get a warning that it can't find my PAPERLESS_CONSUMPTION_DIR. It is in my Home folder, I've set it to be read/writeable by all, etc. What am I doing wrong?? Thank you, this seems like awesome software.

Just some feedback

Hi @danielquinn, thank you for paperless, I like it 😄

Here is a bit of feedback !

About usage

Would it be possible to have a single command to launch both the admin interface and the document consumer at once? I wrote a shell script to do this but if I send one of the two in background then I can't manage the processes properly.

Some issues

A langdetect exception:

Traceback (most recent call last):
File "./manage.py", line 18, in
execute_from_command_line(sys.argv)
File "django/core/management/init.py", line 350, in execute_from_command_line
utility.execute()
File "django/core/management/init.py", line 342, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "django/core/management/base.py", line 348, in run_from_argv
self.execute(_args, *_cmd_options)
File "django/core/management/base.py", line 399, in execute
output = self.handle(_args, *_options)
File "paperless/src/documents/management/commands/document_consumer.py", line 68, in handle
self.loop()
File "paperless/src/documents/management/commands/document_consumer.py", line 98, in loop
text = self._get_ocr(pngs)
File "paperless/src/documents/management/commands/document_consumer.py", line 161, in _get_ocr
guessed_language = langdetect.detect(raw_text)
File "langdetect/detector_factory.py", line 130, in detect
return detector.detect()
File "langdetect/detector.py", line 135, in detect
probabilities = self.get_probabilities()
File "langdetect/detector.py", line 142, in get_probabilities
self._detect_block()
File "langdetect/detector.py", line 149, in _detect_block
raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
langdetect.lang_detect_exception.LangDetectException: No features in text.

Swapped sender and title:
I noticed that the sender (an author name for example, written in the first lines of a PDF), when found, was saved as Title and that the file name was saved as Sender!

That's it.

Screenshot/demo

I want to try this somewhen soon after looking at the README - but a screenshot or a live demo would really help getting an impression how this looks like 😉

Document Categories

Thanks for putting together this tool! After a bit of tooling around, I've got a bare metal install to play with. I was wondering what your thoughts are on creating a new property for Documents for a Category, separate from the tags. That way if you want to separate Documents into "Mail", "Receipts", "Statements", you can do that and perhaps even have different behaviors per Category.

An example of something that I personally would like to see is that for all of my Mail that I scan, I scan the envelope first, then open it up and scan the contents as subsequent pages. It would be cool for the "Mail" category to show a thumbnail of the first page of the PDF (in my case it would be the envelope). For the other Categories, maybe allow the user to upload their own images or whatnot to show for them so that at a glance, the Documents can be quickly gleaned as to which Category they belong to (a briefcase or a lock or whatnot).

If you're not opposed to the idea, I could potentially help implement it, time permitting.

Awesome awesome project!

I totally need this.
Thanks!!!

For tag matching using non space delimiter so can match strings containing spaces

eg: "credit card"
if I search for credit and card seperately i get lots of matches that aren't relavent.
Similar if I search for a company name with common words eg: "Direct Line".

I had a look at regex expressions to match a string with spaces for example the two above, in order.I couldn't find any obvious examples. And it'd seem a bit overkill. if you use a comma , or pipe | might be better seperators.

Also related, I could only see the help text and the fact space delimitation was used by looking at the code.

Make Logging Suck Less

The logging facility I introduced in #16 is crude and doesn't make use of Python's native logging. I did this because I find Python's native logging facility confusing, but I hear that it can do what I want.

So, here's what I think logging should do:

You should be able to call it at arbitrary times, giving each message a level like debug, error, warning, etc.
You should be able to watch the logging output in a stream, like tail -f /path/to/file
You should be able to do REST API calls to get something like "the last 5 messages of level >=info".

The problem for me is that I can't figure out how to solve logging and still allow for the second and third point above. Any advice on this is greatly appreciated.

convert: no images defined

On Mac, OS X Yosemite 10.10.3

Not sure if relevant, I installed imagemagick via brew and so likewise changed the path to the corresponding CONVERT_BINARY in settings.py to "/usr/local/bin/convert"... even after clearing cache, this is an example of the output from the document_consumer script:

Consuming /Users/miles/Admin/Scanned/IMG_0001.pdf
convert: no images defined `/tmp/paperless/2623688.png' @ error/convert.c/ConvertImageCommand/3241.
Traceback (most recent call last):
  File "./manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 350, in execute_from_command_line
    utility.execute()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/__init__.py", line 342, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/Users/miles/Admin/paperless/src/documents/management/commands/document_consumer.py", line 68, in handle
    self.loop()
  File "/Users/miles/Admin/paperless/src/documents/management/commands/document_consumer.py", line 98, in loop
    text = self._get_ocr(pngs)
  File "/Users/miles/Admin/paperless/src/documents/management/commands/document_consumer.py", line 161, in _get_ocr
    guessed_language = langdetect.detect(raw_text)
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector_factory.py", line 130, in detect
    return detector.detect()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector.py", line 135, in detect
    probabilities = self.get_probabilities()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector.py", line 142, in get_probabilities
    self._detect_block()
  File "/Users/miles/Library/Python/3.5/lib/python/site-packages/langdetect/detector.py", line 149, in _detect_block
    raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
langdetect.lang_detect_exception.LangDetectException: No features in text.

A cursory glance at the /tmp/paperless/ directory also confirms that there are no files there. Does that directory need particular permissions for the PNGs to be created?

PARSER_REGEX and PARSER_REGEX_TITLE should be case-insensitive.

It took me a little while to figure out why the consumer was ignoring my PDFs, I moved them directly from my Brother DSmobile 920-DW scanner, which by default has the file extension as .PDF, not .pdf.

Thus they fail this check in the document_consumer script:

        if not re.match(self.PARSER_REGEX_TITLE, pdf):
            continue