Code Monkey home page Code Monkey logo

Comments (13)

DougVegas avatar DougVegas commented on May 13, 2024

Hi @spsanderson

Looking into issue now. However, I think the data set you provided isn't the correct one? You provided customer_trends_tbl but the AutoKMeans example uses a data set named customer_product_tbl

I can't reproduce the error because there is no column called bikeshop_name

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

customer_product_tbl.xlsx

This is the correct file

from autoquant.

AdrianAntico avatar AdrianAntico commented on May 13, 2024

@spsanderson Why is the data in the form that it is? What do the values in each column represent?

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

proportions of purchases of each bike model from a bikeshop. Is the function expecting different form?

from autoquant.

AdrianAntico avatar AdrianAntico commented on May 13, 2024

I wouldn't aggregate the data before running k-means. I would use the transactional data

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

I will try it and report back

from autoquant.

AdrianAntico avatar AdrianAntico commented on May 13, 2024

Sounds good

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

My data looks like the attached, should I make my data strictly the quantity column (this is what I am aggregating)

bike_orderlines_tbl.xlsx

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

So with the following code and attached data I get 2 clusters 0 and 1, it should really be at least 4. Which is what I get from the method posted in the original post.

AutoK_obj <- RemixAutoML::AutoKMeans(
    data = customer_trends_tbl %>% select(-prop_of_total)
    , KMeansK = 15
    , KMeansMetric = "tot_withinss"
    , GridTuneGLRM = TRUE
    , GridTuneKMeans = TRUE
    )

customer_trends_tbl.xlsx

from autoquant.

AdrianAntico avatar AdrianAntico commented on May 13, 2024

@spsanderson I would start tinkering with the arguments. What's going on internally is that a GLRM model from H2O is built first (for the purposes of dimensionality reduction) and you select the number of factors from that to keep and pass on to the KMEANS algo from H2O, which will run to find the optimal k using the factors data from the GLRM.

If you go through the help file (?RemixAutoML::AutoKMeans), you can read up on what each argument does. The function is intended to be flexible for most kinds of data sets but you will want to try several settings if you don't already have a good idea of how to set it for your particular case.

This function is just a beginning for unsupervised learning. I spend most of my time working on the supervised learning stuff since I encounter it more often in practice, but I will get around to enhancing these at some point. If you are interested in helping out let me know.

AutoKMeans <- function(data,
                       nthreads        = 8,
                       MaxMem          = "28G",
                       SaveModels      = NULL,
                       PathFile        = NULL,
                       GridTuneGLRM    = TRUE,
                       GridTuneKMeans  = TRUE,
                       glrmCols        = c(1:5),
                       IgnoreConstCols = TRUE,
                       glrmFactors     = 5,
                       Loss            = "Absolute",
                       glrmMaxIters    = 1000,
                       SVDMethod       = "Randomized",
                       MaxRunTimeSecs  = 3600,
                       KMeansK         = 50,
                       KMeansMetric    = "totss") {

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

from autoquant.

spsanderson avatar spsanderson commented on May 13, 2024

Working through it. Seems that even on the Iris dataset the h2o::kmeans is only producing 2 clusters when we know there are 3. I forked and cloned repo. Will work on it.

from autoquant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.