Code Monkey home page Code Monkey logo

pdf_statement_reader's People

Contributors

dependabot[bot] avatar flywire avatar hyturtle avatar marlaneighty20 avatar marlanperumal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pdf_statement_reader's Issues

Setting up a new config

Can you perhaps assist me with how I go about setting up new config for different statements. I do have the example setup for ZA Absa, and would like to try it with different statements.

Maybe a quick guide in the README to show how all the settings are setup, e.g page layout.

Needs two date formats and reconcile one with the other

My bank documents have two dates with different format, one for the date of operation, one for the effective account change. Plus the first one misses the year. Here is a snapshot:

image

So I need to be able to declare two different date formats and get the year when missing from the other date field, with some heuristic to get the appropriate year where the date are the closest (e.g 30/12 and 02/01/2023: the first one is 30/12/2022 and not 30/12/2023).

issue with test pdf file

Hi. I installed the library on Google Colab and I ran the below instruction but I got the following error.

!psr pdf2csv '/content/bank-statement.pdf'

How can I use it in python code?
Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 3802, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Debit Amount'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/psr", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pdf_statement_reader/init.py", line 80, in pdf2csv
df = parse_statement(input_filename, config)
File "/usr/local/lib/python3.10/dist-packages/pdf_statement_reader/parse.py", line 104, in parse_statement
clean_numeric(statement, config)
File "/usr/local/lib/python3.10/dist-packages/pdf_statement_reader/parse.py", line 50, in clean_numeric
df[col] = df[col].apply(format_negatives)
File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 3807, in getitem
indexer = self.columns.get_loc(key)
File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 3804, in get_loc
raise KeyError(key) from err
KeyError: 'Debit Amount'

Issue while Decrpyting

I'm getting the following issue while decrypting a file by using
psr decrypt 20230601.pdf 20230611-decrypt.pdf

Traceback (most recent call last):
  File "C:\Program Files\Python38\lib\runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python38\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Arindam\OneDrive\Documents\Projects\StatementConverter\env\Scripts\psr.exe\__main__.py", line 9, in <module>
  File "c:\users\arindam\onedrive\documents\projects\statementconverter\env\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\arindam\onedrive\documents\projects\statementconverter\env\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "c:\users\arindam\onedrive\documents\projects\statementconverter\env\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\arindam\onedrive\documents\projects\statementconverter\env\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\arindam\onedrive\documents\projects\statementconverter\env\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\arindam\onedrive\documents\projects\statementconverter\env\lib\site-packages\pdf_statement_reader\__init__.py", line 63, in decrypt
    decrypt_pdf(input_filename, output_filename, password)
  File "c:\users\arindam\onedrive\documents\projects\statementconverter\env\lib\site-packages\pdf_statement_reader\decrypt.py", line 34, in decrypt_pdf
    pdf = Pdf.open(input_filename, password)

Use pipenv

Unreasonable to assume psr will run this progam. Suggest install into Python Virtual Environment.

See #36

TypeError

TypeError: join() missing 1 required positional argument: 'path'

Extract Data between Start and End Strings

Seems to extract data from area between coordinates on each page after data columns found and retains rows with a balance. A better process for bank statements would be between start and end strings:

  1. Search for Account Transactions (occurring on line before field names)
  2. Extract and save field names line
  3. Search for Opening Balance (occurring on first page before first transaction)
  4. Update to search for BALANCE BROUGHT FORWARD (occurring on each non-first page before first transaction)
  5. Read line
  6. If CLOSING BALANCE (occurring on last page after last transaction) end
  7. If BALANCE CARRIED FORWARD (occurring on each non-last page after last transaction) loop to 4 until BALANCE BROUGHT FORWARD found
  8. save line loop to 4
  9. end

Sample CbaBankStatement.pdf and savings.json

Editable: CbaBankStatement2.docx

Parse Errors

Done at #33 (comment) (pdf at #30).

Date    Transaction                             Debit   Credit  Balance
01 Jul  2018 OPENING BALANCE                                    $1,384.89 CR
01 Jul  DEBIT INTEREST CHARGED on this account
        to June 30. 2018 is $0.11
02 Jul  Transfer to another Bank NetBank        372.00          $1,012.89 CR
        Rob Ubank Transfer
  1. Amount starting/ending/contains CR/DB
  2. No year in date
  3. Skip lines starting/ending/contains Balance/Forward
  4. Concatenate wrapped Transaction description
  5. Allow currency sign
  6. Allow thousands separator

Execute Default Use Case (za.absa.cheque)

Hello,
I have installed the software, got an absa cheque template statement but when I try to execute the software I get the below error.
I understood how to create a config file for the statement of my bank (the principle at least) but I cannot even run the default example.
Anyone could help?
Many thanks

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Debit Amount'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/bin/psr", line 8, in
sys.exit(cli())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdf_statement_reader/init.py", line 80, in pdf2csv
df = parse_statement(input_filename, config)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdf_statement_reader/parse.py", line 104, in parse_statement
clean_numeric(statement, config)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdf_statement_reader/parse.py", line 50, in clean_numeric
df[col] = df[col].apply(format_negatives)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 3505, in getitem
indexer = self.columns.get_loc(key)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'Debit Amount'

Statement Columns as Graphics

The Australian Citibank cheque account uses graphics rather than text for statement columns (ie can't swipe it like the transactions) so pdf_statement_reader can't detect the start of the columns. It makes some attempt at the first two columns but it would be better if it used CLOSING BALANCE to detect the end of the transactions rather than picking up broken parts of bank notices.

image

The general layout is similar to CBA except dates are dd Mmm yyyy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.