Ask a Question Question Is TfIdfVectorizer saf

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Is TfIdfVectorizer safe to use? about onnx HOT 6 OPEN

geofforigin8 commented on September 24, 2024

Is TfIdfVectorizer safe to use?

from onnx.

Comments (6)

justinchuby commented on September 24, 2024

cc @xadupre @gramalingam

from onnx.

xadupre commented on September 24, 2024

mlprodict is archived because its main functionality, a python runtime for onnx called OnnxInference, was moved into onnx package as ReferenceEvaluator. I did not write an example about the trick I put in place to avoid ambiguities but it is maintained in unit tests: https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_text.py#L11. This issue is not mentioned in onnx documentation because the ambiguity comes from scikit-learn and not onnx. This case usually don't happen unless the n-grams contain spaces. In that case, the converter gets confused.

from onnx.

geofforigin8 commented on September 24, 2024

Thank you for the quick reply! Using the test cases as a model, I wrote a sample program and I was not having success when using any of the min_df, max_df or max_features parameters.

It looks like convert_sklearn_text_vectorizer is expecting stop_words_ to be a set of string but when min_df et al are set, it is being set to a set of tuples, so as a result running to_onnx on it will fail with an error like this:
TypeError: One stop word is not a string ('is', 'this') in stop_words={('is', 'this'), ...

from onnx.

geofforigin8 commented on September 24, 2024

I've sort of fixed my problem by setting stop_words to an empty set and moving on. However, I'm still seeing some discrepancies and tracked it down to what looks like single-char tokens not being skipped by ONNX?

to reproduce:

import numpy as np 
import skl2onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer, TraceableCountVectorizer
from skl2onnx import update_registered_converter
from skl2onnx.shape_calculators.text_vectorizer import (
    calculate_sklearn_text_vectorizer_output_shapes,
)
from skl2onnx.operator_converters.text_vectoriser import convert_sklearn_text_vectorizer
from skl2onnx.operator_converters.tfidf_vectoriser import convert_sklearn_tfidf_vectoriser
from skl2onnx.common.data_types import StringTensorType
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession


update_registered_converter(
    TraceableCountVectorizer,
    "Skl2onnxTraceableCountVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_text_vectorizer,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

update_registered_converter(
    TraceableTfidfVectorizer,
    "Skl2onnxTraceableTfidfVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_tfidf_vectoriser,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

corpus = np.array([
        "This is the first document.",
        "This document is the second document.",
        "And this is the third one.",
        "Is this the first document?",
        "first Z document", # uncomment this and the last assertion fails because ONNX still sees the "Z" token when it shouldn't?
        "",
    ]
)

FEATURES = 12

vect = TraceableTfidfVectorizer(ngram_range=(1,2), max_features=FEATURES)
vect.fit(corpus)

print('vocabulary:', vect.vocabulary_)
print('stop_words:', vect.stop_words_)

A = vect.transform(corpus).todense()

vect.stop_words_ = set() # should be safe right?

B = vect.transform(corpus).todense()
assert_almost_equal(A,B)  # make sure vectorizer still does what we expect

model_onnx = skl2onnx.to_onnx(
    vect,
    "TfidfVectorizer",
    initial_types=[("input", StringTensorType([None, 1]))],
    target_opset=12,
)

sess = InferenceSession(model_onnx.SerializeToString(), providers=['CPUExecutionProvider'])

inputs = {"input": corpus.reshape((-1, 1))}

out = sess.run(None, inputs)[0]
assert out.shape == (len(corpus), FEATURES)
assert_almost_equal(out.ravel(), A.A1)

sklearn gives the same vector for both strings as expected:
vect.transform(["first document", "first Z document"]).todense()

matrix([[0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ]])

And the model from onnx is is only matching the results on the first string:
sess.run(None, {'input': [["first document"], ["first Z document"]]})[0]

array([[0.        , 0.5182243 , 0.6047494 , 0.6047494 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.65069556, 0.7593387 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]], dtype=float32)

from onnx.

gramalingam commented on September 24, 2024

Hi, I haven't understood the details here. ONNX is a specification. The issue could either be in the conversion side or it could be a bug in the onnxruntime implementation. If it is the latter, I would recommend filing an issue in the onnxruntime repo. Do you know?

Xavier might have a better idea (he is off this week).

from onnx.

geofforigin8 commented on September 24, 2024

Nope, I did not know. I don't have a full image of all the related projects, ownership and relations between them. Thanks for the link though, I thought I found a related issue reported there but, I don't think that's it.

I was able to resolve my issue by explicitly passing "tokenexp": r"\b\w\w+\b" in the options. I don't think this substitution is correct:
https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/operator_converters/text_vectoriser.py#L238

It should be something like [a-zA-Z0-9_]{2,} or \b\w\w+\b. Is the reason for this replacement because the onnx runtime chokes on the (?u) that sklearn has in it's default for token_pattern?

from onnx.

Is TfIdfVectorizer safe to use? about onnx HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent