mkazin / statementrenamer Goto Github PK
View Code? Open in Web Editor NEWPython framework to rename financial statements (or other documents)
License: Apache License 2.0
Python framework to rename financial statements (or other documents)
License: Apache License 2.0
See documentation here: https://docs.travis-ci.com/user/deployment/pypi/
As the number of extractors (and readers) grows, the mean time to process a single document will increase due to needing to perform an increasing number of matcth() tests.
Since match() is currently implemented with simple string compares, this is a very low priority task.
NOTE: I started working on this a while back, see the rcn_fix
branch, an attempt which seems to have failed:
https://github.com/mkazin/StatementRenamer/tree/rcn_fix
This is going to need some bounds-checking, which may or may not need to be applied to all extractors.
It may need a better definition of the statement dates.
Have fun!
From text_rcn.py:
# TODO: this fails because we're parsing the statement date,
# rather than the statement period. Problem is the statement
# period doesn't include a year, which means we need to grab
# it from the statement date (and handle edge cases- ick)
# Let's try checking the PDF/data again (for various versions
# of this statement format) to see if there's some kind of
# halfway elegant solution to this extractor.
Other users of this utility will have their own particular preference when it comes to file naming.
Some might even want to use a shared configuration file.
Some of them may not know how to- (or want to) code to get it working the way they like.
I'm thinking of something similar to standard date formatting.
There may be some use cases to consider. For example:
The Vanguard extractor supports Quarterly statements:
def rename(self, extracted_data):
return self.__class__.FILE_FORMAT.format(
extracted_data.get_end_date().year,
extracted_data.get_end_date().month // 3)
Larger folder operations seem to take a while. Can we do significantly better?
If the delta is significant, generate one or more issues to address optimizing.
Note to visitors, especially inexperienced developers: this is actually a very simple task and you should feel confident to pick it up. Please contact me for guidance if interested.
See the update_readme branch for WIP
Blocked by #11
See https://github.com/zappa/Zappa which provides Lambda deployment for serverless Python.
This is a first step towards making a proper library out of this package.
The goal is to eliminate the main.py file.
For any duplicate hash found in the input files, leave at least one copy. Ideally, the one already named properly.
This should be a unit test in the now-testable Task/Action classes.
SIMULATION Delete statement (1).pdf (Found duplicate hash (64cd378db60b38a5c9f9d79f937c64d8) shared by [2017-Q3 Quarterly Statement.pdf].)
SIMULATION Delete statement (2).pdf (Duplicate hash: 313bbe96f833ae1863f3d95044e10f5c)
SIMULATION Delete statement (3).pdf (Found duplicate hash (25ca43eca67e9230ab684b0da9167157) shared by [2017-Q1 Quarterly Statement.pdf].)
SIMULATION Delete statement.pdf (Duplicate hash: d87465b7adf08e3cdb26e203c2877173)
SIMULATION Rename statement (5).pdf to 2018-Q1 Quarterly Statement.pdf
SIMULATION Delete 2017-Q1 Quarterly Statement.pdf (Duplicate hash: 25ca43eca67e9230ab684b0da9167157)
SIMULATION Delete 2017-Q3 Quarterly Statement.pdf (Duplicate hash: 64cd378db60b38a5c9f9d79f937c64d8)
I should have a copy of those destroyed files in my original test data folder.
très embarrassant
This command:
$ python statement_renamer/ -E Input/Hanscom\ Statements/pdf.pdf | egrep -i "hanscom|hfcu"
yields the following error:
Binary file (standard input) matches
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe
Greater detail can be seen by piping to echo
:
$ python statement_renamer/ -E Input/Hanscom\ Statements/pdf.pdf | echo | egrep -i "hanscom|hfcu"
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "statement_renamer/__main__.py", line 3, in <module>
main.main()
File "statement_renamer/main.py", line 268, in main
raise e
File "statement_renamer/main.py", line 265, in main
task.execute()
File "statement_renamer/main.py", line 76, in execute
self.determine_action_for_file(self.args.positional)
File "statement_renamer/main.py", line 149, in determine_action_for_file
print(contents)
BrokenPipeError: [Errno 32] Broken pipe
This is the second step in eliminating main.py
Blocked by #3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.