Comments (1)
There are 3 main independent issues that explain this behavior, and that would require independent solutions to make the CharTokenizer onnx export to have the same outputs from ML.NET, OnnxRunner and ORT.
Offset between ML.NET columns and OnnxRunner/ORT columns
When using the TokenizingByCharactersTransformer
on ML.NET (without NimbusML), the output of the transformer is of type Vector<Key<UInt16>>
(i.e. a Vector of KeyDataViewTypes
that have UInt16 as RawType
).
Because the outputs are Keys, it is affected by the same issue found on #428, i.e. that the output for ML.NET's Key columns in NimbusML is 0-based, whereas the output of the OnnxRunner and ORT is 1-based. It seems that this offset is caused because NimbusML automatically substracts "1" from KeyDataViewTypes
somewhere in the code. In the issue in here, this behavior can be clearly seen on row 7 of the output.
Fixing #428, should fix this issue in here. As discussed offline, it seems that solving that issue would require major changes in how NimbusML works with Categorical
pandas columns and ML.NET's KeyDataViewTypes
. So this is a major blocker to fix these issues.
NaN vs. 65535 (float vs. uint16)
The TokenizingByCharactersTransformer
maps the char
\uffff
to 65535
, this character is often regarded as "not a character" for UTF-16 (the encoding that the tokenizer uses). The onnx exported model does the exact same mapping. And it seems to me that where there's no character to map on any given SentimentText_Transform
column, it will actually try to map a \uffff
, and that's why we get a 65535
in the ORT output for those columns.
But, why do we get NaN in the other outputs? This has to do with PR #267 "Add variable length vector support". What is relevant from that PR in here, is that NimbusML will take a variable length uint16 vector column and will cast it to float, and also add NaNs where values are "missing". The exact mechanism of how this PR works isn't clear to me, but playing around by modifying what was introduced on that PR (particullarly on files PythonInterop.h
and PythonInterop.cpp
) it seems clear that this issue here is related to that PR. It could be that the last columns of any variable length uint vector are simply filled with NaNs without applying the Tokenizer on those columns, or it could be that somehow the tokenizer is actually applied and then the uint's 65535
are mapped to float's NaN's.
Since the output of ORT doesn't involve NimbusML, it doesn't make the described casts, nor it fills missing values with NaNs, so that's why it is uint16 and has the 65535's.
On the other hand, since ML.NET (without NimbusML) has no problem working with columns with vectors of variable sizes, this issue isn't reproducible using only ML.NET (without NimbusML).
Fixing this issue might require changing or reverting PR #267 (which might bring its own set of problems, given that the behavior of that PR was introduced for a reason), or modify ML.NET's TokenizingByCharactersTransformer
so that it outputs floats instead of uint16's. Either way, further discussion regarding this topic would be needed.
float64 vs. float32
The output from ML.NET is float64 whereas the output of OnnxRunner is float32. Again, this is related to PR #267.
Without using NimbusML, the output of the TokenizingByCharactersTransformer
on ML.NET is of type Vector<Key<UInt16>>
, but the output of applying the exported onnx model (without NimbusML) is of type Vector<UInt16>
. This difference, somehow, is related to the fact that NimbusML casts the first case as float64 and the second case as float32. The code that says how the mappings of the variable length vectors should be handled is the following:
NimbusML/src/NativeBridge/PythonInterop.cpp
Lines 47 to 64 in 1b7c399
Where it says that UInt16
(i.e. unsigned short
or U2
) should be mapped to float32
(i.e. having float
as a dtype, whereas double
would be float64
).
There doesn't seem to be any clear indication to treat Key<UInt16>
differently from UInt16
. But in this code here, it seems that when a DataView is sent from ML.NET to Python, if there's a KeyDataViewType with RawType U2, then that gets casted to I4. Assuming this also holds for Key vectors, then that would explain why ML.NET's output is float64 (i.e. double).
NimbusML/src/DotNetBridge/NativeDataInterop.cs
Lines 141 to 151 in 1b7c399
The exact mechanisms that explain all of the above castings would need further investigation. But perhaps this type mismatch between float64 and float32 isn't a blocking issue, and it wouldn't need to be fixed.
from nimbusml.
Related Issues (20)
- Numerical categorical columns are not supported
- Documentation of Handler's "replace_with" parameter is misguiding
- No distribution found HOT 2
- Output of Label Column when applying ONNX model is not as expected HOT 1
- KMeansPlusPlus PredictedLabel Type
- OneHotVectorizer ONNX model returns 3D array HOT 1
- LogisticRegressionClassifier and FastLinearClassifier fail ONNX export test HOT 1
- Creating model from ml.net and using in nimbusml getting error HOT 3
- Onnx export of ColumnSelector doesn't drop input columns
- Support TreeFeaturizer transform
- Need proper call stack details on RuntimeException in python
- getting error which classification with text label
- 'PcaAnomalyDetector' meets error "System.FormatException: 'One of the identified items was in an invalid format.'"
- 'PcaAnomalyDetector' meets error "System.FormatException: 'One of the identified items was in an invalid format.'" in predicting HOT 1
- Is there any way of getting the topic_word distribution from LightLDA?
- Nimbusmil returned a BridgeRuntimeException with empty callstack property
- Quantile Regression using fast forest regressor
- Docs for Multiclass Classification Metrics are wrong.
- When Transform is called without calling Fit, a ValueError is thrown. Something similar to sklearn.exceptions.NotFittedError would be more appropriate?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nimbusml.