Code Monkey home page Code Monkey logo

mltk's Introduction

Machine Learning Tool Kit

MLTK is a collection of various supervised machine learning algorithms, which is designed for directly training models and further development. For questions or suggestions with the code, please email [email protected].

See wiki for full documentation, examples and other information.

mltk's People

Contributors

dependabot[bot] avatar michaellavelle avatar yinlou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mltk's Issues

Typo in DoublePairComparator

My guess is that it does not create a bug, but there is a typo in the function compare in DoublePairComparator in mltk.predictor.evaluation.AUC.
You wrote:

int cmp = Double.compare(o1.v1, o2.v1);
if (cmp == 0) {
    cmp = Double.compare(o2.v2, o2.v2);
}

but the second comparison should be Double.compare(o1.v2, o2.v2).

GAM plots with nominal interaction terms

It seems to me that there is a bug in Visualizer when one of the interaction terms is not of type BinnedAttribute, because at line 251 there is a cast Bins bins1 = ((BinnedAttribute) f1).getBins(); without a previous type test. This is weird because some tests are made before to handle NominalAttributes, but the end of the code seems to run only on BinnedAttributes (in particular it need boundaries which are only defined for bins..).

"mvn clean package" doesn't work

I'm trying to build on OS X 10.14.4.

  • Downloaded Oracle's java 12 with brew install cask java
  • Downloaded maven with brew install maven
  • Ran mvn clean package as directed here.

I get this error:

1 error
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  9.082 s
[INFO] Finished at: 2019-04-23T21:20:38-05:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:jar (attach-javadocs) on project mltk: MavenReportException: Error while generating Javadoc: 
[ERROR] Exit code: 1 - javadoc: error - The code being documented uses modules but the packages defined in http://docs.oracle.com/javase/8/docs/api/ are in the unnamed module.
[ERROR] 
[ERROR] Command line was: /Library/Java/JavaVirtualMachines/openjdk-12.0.1.jdk/Contents/Home/bin/javadoc @options @packages
[ERROR] 
[ERROR] Refer to the generated Javadoc files in '/Users/dan/work/mltk/target/apidocs' dir.
[ERROR] 
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Might just be docs?

ElasticNet results inconsistent

If I run an elastic net learner a dozen times in a row, I get nine different results. I've confirmed that my input data is identical on each round (the hashcodes on my data strings and attribute strings are identical each round). Is there anything variable or non-deterministic about the elastic net process? Also, all of my work is done using BigDecimal, so any floating point issues would be within the regression. Is that what's happening here?

Need for programmatic setup of datasets

Is there any way to build datasets in memory, using the API, or do I need to write out a file to disk, and read it back in?

I tried creating a dataset using the API, but the methods and constructors of Attribute are not visible, so I can't create a List, so I can't create an Instances object, so I can't create cross-validation folds.

Regression trees cause GC churn

The regression tree methods in MLTK allocate and drop a huge number of objects, which causes GC churn, and hugely strains the VM. A huge amount of time is spent in garbage collection, and the impact is even worse if you are trying to run several regressions in parallel, since the JVM doesn't do a good job of concurrent garbage collection. (The scalability issue alone will probably mean I can't use MLTK for my task.)

An object instance recycling scheme would help immensely with this problem.

Documentation is missing many important details

I have been trying to use MLTK for a regression problem, and I found there are many important details missing from the documentation:

  • There is no info on how to set up datasets in memory programmatically, rather than reading from file (see #14).
  • On the page https://github.com/yinlou/mltk/wiki/Basics , there is no explanation of "(class)". I eventually figured out that it's how you set the target variable. That doesn't make sense for regression though, the target variable is not a class. It's also made more confusing because the section "Example attribute file" has the "(class)" attribute last, but under "Sparse input format", you list the target first, implying that the position is significant (which it is not).
  • There is no explanation that if the number of attributes doesn't match the number of columns, that lines are simply ignored. This is more confusing because it's not clear comparing sections "Example attribute file", "Example data file" and "Sparse Input Format" whether the target attribute even needs to be set in the attribute file or not, or whether its discrete/continuous nature is inferred based on whether you're doing classification or regression (and whether it is inferred that the first column is always the target). If you have the wrong number of attributes, no instances are read.
  • If you don't specify the target attribute, it is set to NaN, and the regression will fail unless you go through all the Instances and set a target value on each one. I couldn't figure out why I was getting NaN until I read through the code.
  • You do say that the attribute file parameter to InstancesReader.read is "optional", but even the JavaDoc doesn't explain that you need to set this to null for this purpose. The way that attributes are automatically inferred for sparse and dense cases if you do set this to null is not explained. (For example, the target is always unset if you don't specify an attributes file, so you'll always get NaN for the target values unless you manually set the target, as described above.)
  • There is no example explaining the usage with a dense datafile.

That's as far as I have gotten so far... I finally got my first regression results, but it took me several hours to figure out how to use this properly... hopefully this feedback helps!

Residuals not saved when building GAM

On the Intelligible Models webpage you say that we should pass the option -R cal_housing.residual when building the GAM model (step 3), so that it can be used to detect the interactions in step4.

However this version of the code does not handle such option and does not save the residuals. Can you confirm that the residuals that you mention are those stored in rTrain in your code ?
I've fixed it and can submit a pull request if you want.

By the way, the option -T is not handled neither, and so the score on the test set is not computed.

Thanks a lot for sharing your work!

Support for data already in program

Does mltk have support for data structures which are already present in the runtime of the program? I'm loading and manipulating data after fetching it from a remote database, and would love to pump in my matrices directly into mltk without writing to disk.

JAR file

I am new to java. Could you make a .jar file please?

is normalization step needed in feature preparation?

If the original features are at different ranges other than [-1, 1], do we need to normalize/calibrate their values before running mltk.predictor.gam.GAMLearner? Or, GAMLearner will take care of it?

Another question is: for mltk.predictor.evaluation.Evaluator, does can we find its metric output? I don't see the metric numbers displayed in output display or saved in any output file?

How to run GA2M with FAST?

I've been reading the docs trying to create an example of classification with GA2MLearner using command-line tools.

I checked in https://github.com/dfrankow/mltk/tree/master/examples with train_ga2m.sh. You should be able to check it out and run it. If I can get it to work, I'm happy to pass it back as an example, as requested in #17.

Several questions:

  • How do we generate a sensible pairwise terms file to pass to GA2MLearner instead of including all? I think that would possibly include the FAST algorithm, but I don't know how to use it.
  • Why does Evaluator not have any output?
  • Can I use the command line to run predictions (in this case classification output) on the test set?

could you provide some working data set?

This is a very interesting package for ml in java.
I think there must be some working data set during the development.
could you provide some of them as demos to help users startup.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.