Code Monkey home page Code Monkey logo

Comments (5)

jklaise avatar jklaise commented on June 10, 2024

Hey, usually issues with mixed data arise from a mis-specification of the categorical_names parameter, I would double check this first and make sure that all categorical columns are keyed and the values of each key cover all categories.

Edit: I see the call to_numpy() which suggests that X is a pd.DataFrame. For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

from alibi.

pocman avatar pocman commented on June 10, 2024

For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

This is it, that's why my example is failing.
It would be great to add some assert check to enforce that convention in anchor_tabular init.

from alibi.

jklaise avatar jklaise commented on June 10, 2024

For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

This is it, that's why my example is failing. It would be great to add some assert check to enforce that convention in anchor_tabular init.

There's multiple ways to go about validation and it's usually fairly tricky to validate custom user data, would be keen to hear if you have more specific suggestions, e.g.:

  • validate categorical_names - this doesn't give us much as it wouldn't confirm whether the actual data is label-encoded or not
  • validate X_train during fit - here we could cross-reference with categorical_names and check that the categorical columns are as expected

from alibi.

pocman avatar pocman commented on June 10, 2024

I would suggest doing it in gen_category_map and maybe update the description of the method.


On another side, AnchorTabular supporting label-encoded values only is a major blocker for my use-case since in my stack I represent missing data as -1. I believe minor changes in the exampler would allow to suppose both encoded and raw label values.

from alibi.

jklaise avatar jklaise commented on June 10, 2024

I understand that some our conventions make it more difficult to cater for all all use-cases, but this is a trade-off we've had to make, at least for the time being. The alternative here would be having to consider any custom encoding scheme, e.g. even allowing label-encoding with arbitrary user-supplied integers for categories would be infeasible without every user providing even more metadata about their specific encoding.

As for your case, there are a couple of workarounds. Essentially missing data is another type of category (separate for each categorical data column). This gives two options:

  • If changing your encoding is an option, you could encode the missing values as the last category for each column. E.g. instead of -1 for every missing value across all categories, for a column i with categories encoded as 0, 1, ..., n_i-1, a missing value would be encoded as n_i (i.e. as an extra category for column i).
  • If changing the encoding for your model is not feasible, you could consider writing a wrapper prediction function similar to this. I.e. the wrapper function would expect the data as alibi expects it (label-encoded - you could use the same trick as above to encode missing data as an extra category), then a transform_input function would transform all those extra categories to -1 before feeding into the model.

I believe minor changes in the exampler would allow to suppose both encoded and raw label values.

I'm not sure I follow here, do you mean string labels for "raw label values" here? Would be good to see what you have in mind.

from alibi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.