Comments (6)
from onnx.
mlprodict is archived because its main functionality, a python runtime for onnx called OnnxInference, was moved into onnx package as ReferenceEvaluator. I did not write an example about the trick I put in place to avoid ambiguities but it is maintained in unit tests: https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_text.py#L11. This issue is not mentioned in onnx documentation because the ambiguity comes from scikit-learn and not onnx. This case usually don't happen unless the n-grams contain spaces. In that case, the converter gets confused.
from onnx.
Thank you for the quick reply! Using the test cases as a model, I wrote a sample program and I was not having success when using any of the min_df, max_df or max_features parameters.
It looks like convert_sklearn_text_vectorizer
is expecting stop_words_
to be a set of string but when min_df et al are set, it is being set to a set of tuples, so as a result running to_onnx on it will fail with an error like this:
TypeError: One stop word is not a string ('is', 'this') in stop_words={('is', 'this'), ...
from onnx.
I've sort of fixed my problem by setting stop_words to an empty set and moving on. However, I'm still seeing some discrepancies and tracked it down to what looks like single-char tokens not being skipped by ONNX?
to reproduce:
import numpy as np
import skl2onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer, TraceableCountVectorizer
from skl2onnx import update_registered_converter
from skl2onnx.shape_calculators.text_vectorizer import (
calculate_sklearn_text_vectorizer_output_shapes,
)
from skl2onnx.operator_converters.text_vectoriser import convert_sklearn_text_vectorizer
from skl2onnx.operator_converters.tfidf_vectoriser import convert_sklearn_tfidf_vectoriser
from skl2onnx.common.data_types import StringTensorType
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession
update_registered_converter(
TraceableCountVectorizer,
"Skl2onnxTraceableCountVectorizer",
calculate_sklearn_text_vectorizer_output_shapes,
convert_sklearn_text_vectorizer,
options={
"tokenexp": None,
"separators": None,
"nan": [True, False],
"keep_empty_string": [True, False],
"locale": None,
},
)
update_registered_converter(
TraceableTfidfVectorizer,
"Skl2onnxTraceableTfidfVectorizer",
calculate_sklearn_text_vectorizer_output_shapes,
convert_sklearn_tfidf_vectoriser,
options={
"tokenexp": None,
"separators": None,
"nan": [True, False],
"keep_empty_string": [True, False],
"locale": None,
},
)
corpus = np.array([
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
"first Z document", # uncomment this and the last assertion fails because ONNX still sees the "Z" token when it shouldn't?
"",
]
)
FEATURES = 12
vect = TraceableTfidfVectorizer(ngram_range=(1,2), max_features=FEATURES)
vect.fit(corpus)
print('vocabulary:', vect.vocabulary_)
print('stop_words:', vect.stop_words_)
A = vect.transform(corpus).todense()
vect.stop_words_ = set() # should be safe right?
B = vect.transform(corpus).todense()
assert_almost_equal(A,B) # make sure vectorizer still does what we expect
model_onnx = skl2onnx.to_onnx(
vect,
"TfidfVectorizer",
initial_types=[("input", StringTensorType([None, 1]))],
target_opset=12,
)
sess = InferenceSession(model_onnx.SerializeToString(), providers=['CPUExecutionProvider'])
inputs = {"input": corpus.reshape((-1, 1))}
out = sess.run(None, inputs)[0]
assert out.shape == (len(corpus), FEATURES)
assert_almost_equal(out.ravel(), A.A1)
sklearn gives the same vector for both strings as expected:
vect.transform(["first document", "first Z document"]).todense()
matrix([[0. , 0.51822427, 0.60474937, 0.60474937, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0.51822427, 0.60474937, 0.60474937, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]])
And the model from onnx is is only matching the results on the first string:
sess.run(None, {'input': [["first document"], ["first Z document"]]})[0]
array([[0. , 0.5182243 , 0.6047494 , 0.6047494 , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0.65069556, 0.7593387 , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]], dtype=float32)
from onnx.
Hi, I haven't understood the details here. ONNX is a specification. The issue could either be in the conversion side or it could be a bug in the onnxruntime implementation. If it is the latter, I would recommend filing an issue in the onnxruntime repo. Do you know?
Xavier might have a better idea (he is off this week).
from onnx.
Nope, I did not know. I don't have a full image of all the related projects, ownership and relations between them. Thanks for the link though, I thought I found a related issue reported there but, I don't think that's it.
I was able to resolve my issue by explicitly passing "tokenexp": r"\b\w\w+\b"
in the options. I don't think this substitution is correct:
https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/operator_converters/text_vectoriser.py#L238
It should be something like [a-zA-Z0-9_]{2,}
or \b\w\w+\b
. Is the reason for this replacement because the onnx runtime chokes on the (?u)
that sklearn has in it's default for token_pattern?
from onnx.
Related Issues (20)
- [1.16.2] ONNX not built correctly for Windows HOT 21
- Shape inference fails on Concat when first tensor is initialized empty HOT 1
- Precision Loss During Weight Transfer from ONNX to MLPack Model HOT 3
- error when importing onnx_cpp2py_export HOT 2
- MissingType Handling similar to LGBM models for TreeEnsemble ops
- ImportError: DLL load failed while importing onnx_cpp2py_export: 动态链接库(DLL)初始化例程失败。 HOT 2
- Inference time discrepancy when using TorchScript vs ONNX exported model
- Performance Discrepancy Between PyTorch and ONNX Model Conversion: Unexpected Increase in Latency HOT 1
- [Feature request] Implement the reference runtime with the array api
- Inference session with 'optimized_model_filepath' expands Gelu operator into atomic operators
- [Performance] The unpack_int4 function has a low efficiency. HOT 5
- Shape inference segfaults when adding tensors of shapes `(0,)` and `(1,)` HOT 1
- Build Error: Protobuf pinned version missing `<cstdint>` definition on its headers
- Refactor the subbyte module
- No Adapter From Version $14 for Mul HOT 3
- Deprecation / Update Policy for onnx dependencies? HOT 2
- How to parse a function with variadics? HOT 5
- Split onnx model in architecture and weights HOT 2
- a) Feature Request: Function sample_dirichlet, b) State of probabilistic model support? HOT 3
- Using onnx shape inference some operator doesn't support shape inference HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onnx.