mindee / mindee-api-python Goto Github PK
View Code? Open in Web Editor NEWMindee API Helper Library for Python
Home Page: https://mindee.com
License: MIT License
Mindee API Helper Library for Python
Home Page: https://mindee.com
License: MIT License
Running the prediction.py script provided here I ran into a divided by zero error on a specific document.
Expected behavior:
Parse the document and provide json
Actual behavior:
Parse the document, get prediction and crash at reconstruction step reconstructed_total += tax.value + 100 * tax.value / tax.rate
Reproduces how often:
100% of the time on that specific file
latest mindee-api-python code available
Stack trance:
[REDACTED] factures TVA novembre 2006-page15.pdf float division by zero
Traceback (most recent call last):
File "prediction.py", line 13, in <module>
mindee_response = mindee_client.parse_invoice(test_file_path)
File "/home/pafer/.local/share/virtualenvs/mindee-wRBzjoc1/lib/python3.8/site-packages/mindee/__init__.py", line 180, in parse_invoice
return self._wrap_response(input_file, response, "invoice")
File "/home/pafer/.local/share/virtualenvs/mindee-wRBzjoc1/lib/python3.8/site-packages/mindee/__init__.py", line 93, in _wrap_response
return Response.format_response(dict_response, document_type, input_file)
File "/home/pafer/.local/share/virtualenvs/mindee-wRBzjoc1/lib/python3.8/site-packages/mindee/__init__.py", line 302, in format_response
Invoice(
File "/home/pafer/.local/share/virtualenvs/mindee-wRBzjoc1/lib/python3.8/site-packages/mindee/documents/invoice.py", line 91, in __init__
self._checklist()
File "/home/pafer/.local/share/virtualenvs/mindee-wRBzjoc1/lib/python3.8/site-packages/mindee/documents/invoice.py", line 192, in _checklist
"taxes_match_total_incl": self.__taxes_match_total_incl(),
File "/home/pafer/.local/share/virtualenvs/mindee-wRBzjoc1/lib/python3.8/site-packages/mindee/documents/invoice.py", line 212, in __taxes_match_total_incl
reconstructed_total += tax.value + 100 * tax.value / tax.rate
ZeroDivisionError: float division by zero
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "prediction.py", line 18, in <module>
print(e.with_traceback())
TypeError: with_traceback() takes exactly one argument (0 given)
Sign packages for PyPi
Put an X between the brackets on this line if you have done all of the following:
pip install -r requirements.txt
Expected behavior:
Install without any errors
Actual behavior:
I get this error fitz/fitz_wrap.c:2754:10: fatal error: 'fitz.h' file not found
See https://asciinema.org/a/0bq80oBoozUsqBd7aAUQwbj6o
Reproduces how often:
All the time
macOS: 11.5.2 on M1 architecture
Python: 3.9.7
Pip: 21.3
The SDK is not able to parse the HTTP URLs when using on Windows. URLs generation is somehow using python os.path helpers that make path with "/" on Unix systems but "\" on Windows.
Tested on version v1.2.3
Just a small typo in the readme, I found the real value by looking through the code base.
This is in the readme:
from mindee import Client
mindee_client = Client(
expense_receipts_token="your_expense_receipts_api_token_here",
but the named parameter should be expense_receipt_token
.
receipts
vs receipt
Add instruction on contributing like preparing your environment with things like pip install -r requirements.dev.txt
...
When I prepare a path
input, when printing self.file_extension
I get application/pdf, but the API returns an Invalid mimetype application/octet-stream
meaning my original mimetype seems to be overriden somewhere
Add copyright headers in all files. See docTR repo as an example.
Add proper README badges: check docTR as a good example.
Hi,
I just got an error like the one below for an api call that I did not see before:
File "/usr/local/lib/python3.12/site-packages/mindee/mindee_http/response_validation.py", line 67, in clean_request_json and response_json["api_request"]["status_code"].isdigit()
AttributeError: 'int' object has no attribute 'isdigit'
This happened after calling a custom API built with the new API builder. Previous calls did not teilen this error. Please let me know if I can provide you more information.
Since v1.2.2, a few PDF files are recognized as blank (zero page).
The function check_if_document_is_empty
in mindee/inputs.py
does not check for PDF "paths" (only image & text) to decide whether the page is blank or not. Some rare scanned PDF have no "image" and are then considered empty making the inference impossible.
Parsing pdfs with blank pages (i.e zero text inside) throw a ValueError.
Put an X between the brackets on this line if you have done all of the following:
Got a non graceful error message when trying to use the CLI for financial type of document
run ./mindee-cli.sh financial -i path file.pdf
Expected behavior:
Return information about the financial document I used.
Actual behavior:
Traceback (most recent call last):
File "/Users/fharper/.pyenv/versions/3.9.7/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/fharper/.pyenv/versions/3.9.7/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/fharper/Dropbox/Mac (3)/Documents/code/mindee/sdk-python/mindee/__main__.py", line 185, in <module>
call_endpoint(parse_args())
File "/Users/fharper/Dropbox/Mac (3)/Documents/code/mindee/sdk-python/mindee/__main__.py", line 68, in call_endpoint
client = _ots_client(args, info)
File "/Users/fharper/Dropbox/Mac (3)/Documents/code/mindee/sdk-python/mindee/__main__.py", line 44, in _ots_client
func = getattr(client, f"config_{args.product_name}")
AttributeError: 'Client' object has no attribute 'config_financial'
Reproduces how often:
Always
2.0.1
Put an X between the brackets on this line if you have done all of the following:
--no-raise-errors
definition is displayed after a break line probably caused by the fact it's too long.
Run ./mindee-cli.sh -h
Expected behavior:
no break line
Actual behavior:
Reproduces how often:
Always
2.0.1
Summary:
My API project is facing a KeyError 'page_id' when using new custom endpoints with the Mindee API. This issue is not present with older endpoints, suggesting a potential problem with the newer custom endpoint configuration or the library's response handling.
Background:
Our application leverages Mindee's OCR capabilities to extract data from PDFs. While older endpoints operate as expected with library version 3.13, the integration with newly created custom endpoints leads to a KeyError, even after updating to version 4.0.1.
Issue Description:
Upon invoking client.parse with product.CustomV1 for new custom endpoints, a KeyError is thrown, indicating the absence of 'page_id' in the parsed results. This exception is traced back to the construction of ListFieldV1 within the Mindee library.
Logs:
Traceback (most recent call last): File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\utils\mindee\mindee.py", line 30, in analyze_pdf_with_mindee result = mindee_client.parse(product.CustomV1, input_doc, endpoint=custom_endpoint) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\venv\Lib\site-packages\mindee\client.py", line 114, in parse return self._make_request( ^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\venv\Lib\site-packages\mindee\client.py", line 339, in _make_request return PredictResponse(product_class, dict_response) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\venv\Lib\site-packages\mindee\parsing\common\predict_response.py", line 28, in __init__ self.document = Document(prediction_type, raw_response["document"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\venv\Lib\site-packages\mindee\parsing\common\document.py", line 47, in __init__ self.inference = prediction_type(raw_response["inference"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\venv\Lib\site-packages\mindee\product\custom\custom_v1.py", line 27, in __init__ self.prediction = CustomV1Document(raw_prediction["prediction"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\venv\Lib\site-packages\mindee\product\custom\custom_v1_document.py", line 29, in __init__ self.fields[field_name] = ListFieldV1(field_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC\PycharmProjects\multiCotizadorAPI\venv\Lib\site-packages\mindee\parsing\custom\list.py", line 45, in __init__ self.page_id = raw_prediction["page_id"] ~~~~~~~~~~~~~~^^^^^^^^^^^ KeyError: 'page_id'
Troubleshooting Steps:
Current Findings:
-We request the Mindee team's input on this matter. Could you advise on any known issues with custom endpoint integration or suggest further steps we might take to debug this problem? Any assistance would be greatly appreciated to facilitate a swift resolution.
Thank you for your attention to this matter.
Change API token reference to API Key term
Mindee library raises raise MimeTypeError(f"Could not determine MIME type of '{self.filename}'")
when being used with tempfile.NamedTemporaryFile
.
f = tempfile.NamedTemporaryFile()
# Some TMP file magic happens here
mindee_client = Client(api_key=self.config.MINDEE_API_KEY.get_secret_value())
input_doc = mindee_client.doc_from_bytes(f.read(), f.name)
The problem is caused by the usage of library mimetypes which relies on file extensions when trying to guess MIME of file. However, temporary name, available in NamedTemporaryFile doesn't contain the file extension, hence, mimetypes can't guess MIME.
There are two solutions:
doc_from_bytes()
function accepting MIME types sent there manually.magic
library, based on libmagic. It can guess MIME pretty well using binary signatures of different types of files. It creates extra dependencies for developers since it requires libmagic to be installed on the computer.However, guessing MIME from the file extension is not an optimal way to define correct MIME since sometimes users upload, for example, PNG files with .jpg extension and it will cause errors.
Find pymupdf alternative for PDF page splitting.
Add install from the source instructions
Add link to contributing Guideline and code of conduct in the README.md
The node SDK contains a header User-Agent
: mindee-node/${sdkVersion} node/${process.version}
when making the request to Mindee API (api.mindee.net/v1).
We should add the same header so that Mindee API backend has the information.
Example:
User-Agent
: mindee-python/v1.3.0 python/3.8.12
Hi,
it's me again. Thank you for fixing the http code error so quickly. Now I can see the actual error from the server but it's not very informative:
File "/usr/local/lib/python3.12/site-packages/mindee/client.py", line 397, in _get_queued_document raise handle_error(mindee.error.mindee_http_error.MindeeHTTPServerError: custom 500 HTTP error: None - None
My code and the api works fine with other documents but not with this particular one. Is there anything I can do to debug this?!
Put an X between the brackets on this line if you have done all of the following:
Cannot install due to failed cascading dependencies on Big Sur
pip install mindee
on MacOS Big SurExpected behavior:
Install mindee
and dependencies, especially matplotlib
Actual behavior:
Failed to compile numpy
, a dependency of matplotlib
on Mac Intel
Reproduces how often:
100%
I checked the issues on matplotlib project and maintener rejected the issue reported 3 days ago saying basicly "not my problem if you're using arm" except ... it's not an ARM issue :/
Put an X between the brackets on this line if you have done all of the following:
In the passport parsing the passport.full_name.value prints first & last, but no middle name.
Expected behavior:
full_name should contain all values of name
First [middle] last (where there can be multiple middle names.
Actual behavior:
First Last
Reproduces how often:
Tested with 2 passports.
Put an X between the brackets on this line if you have done all of the following:
Expected behavior:
I expect the headers to be forwared to Mindee's API using the right format, e.g: Authorization: "Token <my_token>"
Actual behavior:
The header miss the Token
part.
Reproduces how often:
All APIs calls
The currently used format is deprecated
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.