Comments (4)
on the batch predict scene, I think it would be better if has option to choose raise exception or set this record result to NAN and keep predict the others.
The Java interface
o.j.e.Evaluator
only supports single-row prediction mode viaEvaluator#evaluate(Map)
.The Python interface builds its batch prediction mode
jpmml_evaluator.Evaluator.evaluateAll(DataFrame)
on top it. The main benefit of the batch interface is to send all rows from Java to Python as a single call (instead of many calls, one call per row).Now, this is actually a good idea that the JPMML-Evaluator-Python should provide an option for configuring a "what to do about an EvaluationException".
I can quickly think of two options:
- "return invalid" aka "as-is". Matches the current behaviour, where the Java exception is propagated to the top, and the evaluation is stopped at that location.
- "replace with NaN" aka "ignore". The Java component will catch a row-specific exception, and replaces the result for that row with
Double#NaN
(or some other user-specified constant?).Also, in "return invalid" aka "as-is" mode, it should be possible to configure if partial results can be returned or not. Suppose there is a batch of 10'000 rows, and the evaluation fails on row 8566 because of some data input error. I think it might makse sense to return the leading 8565 results in that case.
right, it's really friendly options; and this two options is adding under current version behavior which is just throw exception, right? like you said , it's importance to clear feedback, this options is importance either.
and I was thinking the "replace with NaN" need a threshold or specified rows number to stop evaluation or not, because on some specified scene which is people use the wrong data, it would be a little annoying that still evaluation all data.
what is your thinking?
from jpmml-evaluator-python.
There is a third option - "omit row" aka "drop". If there are evaluation errors, then the corresponding rows are simply omitted from the results batch.
The "omit row" option assumes that the user has assigned custom identifiers to the rows of the arguments batch. So, if there are 156 argument rows, and only 144 result rows (meaning that 12 rows errored out), then the user can locally identify "successful" vs "failed" rows in her application code.
See #23 about row identifiers.
from jpmml-evaluator-python.
As a general comment - my "design assumption" behind the Evaluator.evaluateAll(X)
method is that the size of the arguments dataframe is about/up to 10'000 cells (eg. a dataframe of 10 features x 1000 rows).
My thinking is that the data is being moved between Python and Java environments using the Pickle protocol. If the pickle payload gets really big (say, 1'000'000 cells instead of 10'000 cells), then the Java component responsible for loading/dumping might start hitting unexpected memory/processing limitations.
If the dataset is much bigger than 10'000 cells, then it should be partitioned into multiple chunks in Python application code. And the chunking algorithm should be prepared to handle the "omit row" option gracefully.
from jpmml-evaluator-python.
my "design assumption" behind the Evaluator.evaluateAll(X) method is that the size of the arguments dataframe is about/up to 10'000 cells
The Evaluator.evaluateAll(X)
method should have an extra parameter for controlling the batch size. The default would be my design assumption - about 10'000 cells. But the end user can increase or decrease its value if needed.
This way, the chunking logic would be nicely available at the JPMML-Evaluator-Python library level, leaving the actual Python application code clean.
from jpmml-evaluator-python.
Related Issues (20)
- Question: Can I use sklearn2pmml plugin in jpmml evaluator for Python? HOT 1
- Using Python equivalent of the basic usage of jpmml-evaluator from Java HOT 2
- Getting subprocess.CalledProcessError: Command '['which', 'javac']' returned non-zero exit status 1 when calling make_evaluator with jnius HOT 7
- PyJNIus backend can't handle `None` dict values HOT 6
- AttributeError: 'Timestamp' object has no attribute '_get_object_id' HOT 5
- Atomic data exchange between Python and Java HOT 1
- Reporting of PMML HOT 2
- Is there a way to turn off `too many input fields` exception? HOT 6
- Reflect Java exception hierarchy in Python HOT 2
- How to handle NaN fields HOT 2
- Problems when inputting values for date/datetime fields HOT 27
- Function "lessOrEqual" cannot accept missing value at position 0 HOT 6
- Using PMML with SkLearn's train-test split workflow HOT 12
- Advice for debugging erroneous input and/or PMML documents HOT 4
- Choosing a default backend depending on the system architecture
- Index of evaluateAll output DF does not match that of input HOT 3
- Convert PMML serialized model to Sklearn HOT 2
- Replace numpy.NaN with numpy.nan HOT 3
- ModuleNotFoundError: No module named 'pkg_resources' with Python 3.12 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jpmml-evaluator-python.