mkazin / statementrenamer Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 136 KB

Python framework to rename financial statements (or other documents)

License: Apache License 2.0

Python 100.00%

statementrenamer's People

Contributors

Stargazers

Watchers

statementrenamer's Issues

Deploy Travis output to pypi

See documentation here: https://docs.travis-ci.com/user/deployment/pypi/

Refactor task.py and implement unit tests

Extractor scaling issue

As the number of extractors (and readers) grows, the mean time to process a single document will increase due to needing to perform an increasing number of matcth() tests.

Since match() is currently implemented with simple string compares, this is a very low priority task.

Fix RCN's disabled "test_start_of_month_statement" unit test

NOTE: I started working on this a while back, see the rcn_fix branch, an attempt which seems to have failed:
https://github.com/mkazin/StatementRenamer/tree/rcn_fix

This is going to need some bounds-checking, which may or may not need to be applied to all extractors.
It may need a better definition of the statement dates.
Have fun!

From text_rcn.py:

# TODO: this fails because we're parsing the statement date,
# rather than the statement period. Problem is the statement
# period doesn't include a year, which means we need to grab
# it from the statement date (and handle edge cases- ick)
# Let's try checking the PDF/data again (for various versions
# of this statement format) to see if there's some kind of
# halfway elegant solution to this extractor.

Make output filenames user-configurable

Other users of this utility will have their own particular preference when it comes to file naming.
Some might even want to use a shared configuration file.
Some of them may not know how to- (or want to) code to get it working the way they like.
I'm thinking of something similar to standard date formatting.

There may be some use cases to consider. For example:

The Vanguard extractor supports Quarterly statements:

    def rename(self, extracted_data):
        return self.__class__.FILE_FORMAT.format(
            extracted_data.get_end_date().year,
            extracted_data.get_end_date().month // 3)

Determine if it makes sense to invest in concurrency

Larger folder operations seem to take a while. Can we do significantly better?

Measure disk read time using hash-only
Measure time to perform renaming

If the delta is significant, generate one or more issues to address optimizing.

Note to visitors, especially inexperienced developers: this is actually a very simple task and you should feel confident to pick it up. Please contact me for guidance if interested.

Design Review

Following the refactor in #24 review the current design and create issues for possible improvements

(PS- UML available here)

Update README.md

See the update_readme branch for WIP
Blocked by #11

Lambda Deployment using Zappa

See https://github.com/zappa/Zappa which provides Lambda deployment for serverless Python.

Refactor: extract Task from main.py, test it properly

This is a first step towards making a proper library out of this package.
The goal is to eliminate the main.py file.

Leave one copy of each duplicate hash

For any duplicate hash found in the input files, leave at least one copy. Ideally, the one already named properly.

This should be a unit test in the now-testable Task/Action classes.

SIMULATION Delete statement (1).pdf (Found duplicate hash (64cd378db60b38a5c9f9d79f937c64d8) shared by [2017-Q3 Quarterly Statement.pdf].)
SIMULATION Delete statement (2).pdf (Duplicate hash: 313bbe96f833ae1863f3d95044e10f5c)
SIMULATION Delete statement (3).pdf (Found duplicate hash (25ca43eca67e9230ab684b0da9167157) shared by [2017-Q1 Quarterly Statement.pdf].)
SIMULATION Delete statement.pdf (Duplicate hash: d87465b7adf08e3cdb26e203c2877173)
SIMULATION Rename statement (5).pdf to 2018-Q1 Quarterly Statement.pdf
SIMULATION Delete 2017-Q1 Quarterly Statement.pdf (Duplicate hash: 25ca43eca67e9230ab684b0da9167157)
SIMULATION Delete 2017-Q3 Quarterly Statement.pdf (Duplicate hash: 64cd378db60b38a5c9f9d79f937c64d8)

I should have a copy of those destroyed files in my original test data folder.

très embarrassant

Fix broken pipe when piping output from -E

This command:
$ python statement_renamer/ -E Input/Hanscom\ Statements/pdf.pdf | egrep -i "hanscom|hfcu"

yields the following error:

Binary file (standard input) matches
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

Greater detail can be seen by piping to echo:
$ python statement_renamer/ -E Input/Hanscom\ Statements/pdf.pdf | echo | egrep -i "hanscom|hfcu"

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "statement_renamer/__main__.py", line 3, in <module>
    main.main()
  File "statement_renamer/main.py", line 268, in main
    raise e
  File "statement_renamer/main.py", line 265, in main
    task.execute()
  File "statement_renamer/main.py", line 76, in execute
    self.determine_action_for_file(self.args.positional)
  File "statement_renamer/main.py", line 149, in determine_action_for_file
    print(contents)
BrokenPipeError: [Errno 32] Broken pipe

Fix test coverage in TravisCI

Refactor: extract directory walking from main.py, provide it as a utility method

This is the second step in eliminating main.py
Blocked by #3

mkazin / statementrenamer Goto Github PK

statementrenamer's People

Contributors

Stargazers

Watchers

statementrenamer's Issues

Recommend Projects

Recommend Topics

Recommend Org