Code Monkey home page Code Monkey logo

statementrenamer's People

Contributors

mkazin avatar

Stargazers

 avatar

Watchers

 avatar

statementrenamer's Issues

Extractor scaling issue

As the number of extractors (and readers) grows, the mean time to process a single document will increase due to needing to perform an increasing number of matcth() tests.

Since match() is currently implemented with simple string compares, this is a very low priority task.

Fix RCN's disabled "test_start_of_month_statement" unit test

NOTE: I started working on this a while back, see the rcn_fix branch, an attempt which seems to have failed:
https://github.com/mkazin/StatementRenamer/tree/rcn_fix

This is going to need some bounds-checking, which may or may not need to be applied to all extractors.
It may need a better definition of the statement dates.
Have fun!

From text_rcn.py:

# TODO: this fails because we're parsing the statement date,
# rather than the statement period. Problem is the statement
# period doesn't include a year, which means we need to grab
# it from the statement date (and handle edge cases- ick)
# Let's try checking the PDF/data again (for various versions
# of this statement format) to see if there's some kind of
# halfway elegant solution to this extractor.

Make output filenames user-configurable

Other users of this utility will have their own particular preference when it comes to file naming.
Some might even want to use a shared configuration file.
Some of them may not know how to- (or want to) code to get it working the way they like.
I'm thinking of something similar to standard date formatting.

There may be some use cases to consider. For example:

The Vanguard extractor supports Quarterly statements:

    def rename(self, extracted_data):
        return self.__class__.FILE_FORMAT.format(
            extracted_data.get_end_date().year,
            extracted_data.get_end_date().month // 3)

Determine if it makes sense to invest in concurrency

Larger folder operations seem to take a while. Can we do significantly better?

  1. Measure disk read time using hash-only
  2. Measure time to perform renaming

If the delta is significant, generate one or more issues to address optimizing.

Note to visitors, especially inexperienced developers: this is actually a very simple task and you should feel confident to pick it up. Please contact me for guidance if interested.

Design Review

Following the refactor in #24 review the current design and create issues for possible improvements

(PS- UML available here)

Leave one copy of each duplicate hash

For any duplicate hash found in the input files, leave at least one copy. Ideally, the one already named properly.

This should be a unit test in the now-testable Task/Action classes.

SIMULATION Delete statement (1).pdf (Found duplicate hash (64cd378db60b38a5c9f9d79f937c64d8) shared by [2017-Q3 Quarterly Statement.pdf].)
SIMULATION Delete statement (2).pdf (Duplicate hash: 313bbe96f833ae1863f3d95044e10f5c)
SIMULATION Delete statement (3).pdf (Found duplicate hash (25ca43eca67e9230ab684b0da9167157) shared by [2017-Q1 Quarterly Statement.pdf].)
SIMULATION Delete statement.pdf (Duplicate hash: d87465b7adf08e3cdb26e203c2877173)
SIMULATION Rename statement (5).pdf to 2018-Q1 Quarterly Statement.pdf
SIMULATION Delete 2017-Q1 Quarterly Statement.pdf (Duplicate hash: 25ca43eca67e9230ab684b0da9167157)
SIMULATION Delete 2017-Q3 Quarterly Statement.pdf (Duplicate hash: 64cd378db60b38a5c9f9d79f937c64d8)

I should have a copy of those destroyed files in my original test data folder.

très embarrassant

Fix broken pipe when piping output from -E

This command:
$ python statement_renamer/ -E Input/Hanscom\ Statements/pdf.pdf | egrep -i "hanscom|hfcu"

yields the following error:

Binary file (standard input) matches
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

Greater detail can be seen by piping to echo:
$ python statement_renamer/ -E Input/Hanscom\ Statements/pdf.pdf | echo | egrep -i "hanscom|hfcu"

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "statement_renamer/__main__.py", line 3, in <module>
    main.main()
  File "statement_renamer/main.py", line 268, in main
    raise e
  File "statement_renamer/main.py", line 265, in main
    task.execute()
  File "statement_renamer/main.py", line 76, in execute
    self.determine_action_for_file(self.args.positional)
  File "statement_renamer/main.py", line 149, in determine_action_for_file
    print(contents)
BrokenPipeError: [Errno 32] Broken pipe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.