Code Monkey home page Code Monkey logo

haifengl / smile Goto Github PK

View Code? Open in Web Editor NEW
5.9K 274.0 1.1K 244.73 MB

Statistical Machine Intelligence & Learning Engine

Home Page: https://haifengl.github.io

License: Other

Java 67.06% Harbour 0.01% Scala 6.62% HTML 8.44% Shell 0.10% CSS 0.08% JavaScript 0.05% Jupyter Notebook 2.34% Scilab 10.43% Objective-C++ 0.01% Kotlin 1.51% Clojure 0.95% Terra 2.27% SuperCollider 0.04% Nunjucks 0.08%
machine-learning regression clustering manifold-learning nlp visualization classification nearest-neighbor-search interpolation wavelet

smile's People

Contributors

albertmolinermrf avatar alexey-grigorev-sm avatar beckgael avatar benmccann avatar diegocatalano avatar digital-thinking avatar doghere avatar elkfrawy-df avatar gforman44 avatar ghostdogpr avatar gitter-badger avatar haifengl avatar inejc avatar jpe42 avatar kevincooper avatar kid1412z avatar kno10 avatar myui avatar pdkovacs avatar peter-toth avatar pierrenodet avatar rafaelsakurai avatar rayeaster avatar serickso avatar takanori-ugai avatar tdunning avatar tomassvensson avatar twotabbies avatar xyclade avatar yaqiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smile's Issues

load the csv

hello,i want to know how to load the csv files.

Basic Tutorial Needed

The project seems supercool and well documented, but I think it needs a basic tutorial to get started and get to know the main classes to use. Something as simple as:

  1. import the data
  2. split in train and test set
  3. train a classifier on the training set
  4. test the classifier on the test set

Make all classes serializable

When using random forest, for example, every time the model is trained it results in a slightly different prediction.

Would be import to train the model for the first time, save the model in a file using serialization, and then use this saved version to always get the same results

For that to work, all classes must the serializable

Publish smile-scala to Maven

Noticed that smile-core is in Maven central but the Scala API has been left out :(

Can you help turn :( into :) ?

How to use SVM with only 2 classes

I'm trying to use SVM algorithm to train a classifier for a dataset with only 2 classes. Java is throwing an illegal argument exception with message "Invalid number of classes: 2".

How can I train a classifier using the algorithm in this specifications?

NLP initialisation breaks on new SimpleCorpus

When I start a new project, and include smile (including NLP) from Maven.

Simply running the following code gives an initialization exception:

new SimpleCorpus();

The library seems to be missing the dictionary files.

Stacktrace:

caused by: java.lang.NullPointerException
at java.io.Reader.(Reader.java:78)
at java.io.InputStreamReader.(InputStreamReader.java:72)
at smile.nlp.dictionary.EnglishStopWords.(EnglishStopWords.java:59)
at smile.nlp.dictionary.EnglishStopWords.(EnglishStopWords.java:34)
... 17 more

SmileDemo

Can you please provide sources for SmileDemo?
Thanks!
-a

Failing Test

When I run mvn test I get this output:

Failed tests:   testMultivariateGaussianDistribution(smile.stat.distribution.MultivariateGaussianDistributionTest): expected:<0.9> but was:<1.0042573275810218>
  testLogNormalDistribution(smile.stat.distribution.LogNormalDistributionTest): expected:<3.0> but was:<3.120146672473464>

Which is best way to represent a binary attribute on supervised learning algorithms? NominalAttribute or NumeralAttribute?

Hi!

I am working on a tutorial about how to use Smile in NLP applications and I have a few questions that I did not find an easy answer inside the javadocs.

Is there a place that I could ask you, @haifengl ?
Or it is okay to use this issue to talk about...

Btw, the first one: which is the best way to represent a binary attribute on supervised learning algorithms? NominalAttribute or NumeralAttribute?

Thanks a lot!

Axis Scale

Is it possible to change the axis labeling to a custom range? I've created a heatmap from spectral data:

heatmap

Currently the y-axis labels the pixels (bin numbers in my case) but it would be nice if I could modify the plot to label a range of values that I provide (i.e. frequency range).

I've done a good amount of digging and messing around with the smile.plot.Base class but I get the sense I'm barking up the wrong tree.

CrossValidation with fixed partitions

Hi,

already check a bunch of classes but with no luck to find an example of.

SO, is there a "ready code" to run CrossValidation that return fixed partitions?

Boxplot: unable to create correct plot when series are different lengths

Hi, and thank you for this awesome library.

I'm unable to create a correct multi-box boxplot when the number of observations differs between the boxes.

For example, I want to create a plot on a dataset where the data of interest is a person's age, and there is one "box" displayed for males and a second for females. If there are more males, the box for females depicts observations at 0.0 for each observation that has no value. This throws off the distribution.

In the attached plot image, the dataset has no actual 0.0 values, but some are depicted.

I tried initializing my dataset to Double.NaN before filling it, but this caused problems with rendering the graph.

Thank you again for SMILE.

boxplot

Publish to Maven repository

I'd love it if you could make these libraries available via Maven. It would make it much easier to include them in my own application. It should be pretty easy to add them to the central Maven repo since you already have a Maven build file

Lots of printlines from Math.java when running Lasso

In the Math.java on lines 4596 to 4601 there are 2 prints. When running a Lasso regression this causes a lot of output on the console, making it hard to read console output from my own code. Could these be turned off or, made optional if they are really important? 😄

Caught and logged exception with randomforest

Things seem to work, but I'm seeing a exception in the console with a small test app:

java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: -1

Runnable source here:
https://github.com/mschulkind/smile-test

This is the bulk of it:

DelimitedTextParser parser = new DelimitedTextParser();
parser.setColumnNames(true);
parser.setDelimiter(",");
parser.setResponseIndex(new NumericAttribute("MEDV"), 13);

AttributeDataset dataset = parser.parse("housing.csv");

double[][] x = dataset.toArray(new double[dataset.size()][]);
double[] y = dataset.toArray(new double[dataset.size()]);

RandomForest forest = 
    new RandomForest(dataset.attributes(), x, y, 200); // Exception logged here.

add NER model

All ingredients there, train a model on all public ner data including CoNLL, MUC-6, MUC-7 and ACE.

Crossvalidation Question

It might be my understanding of cross validation, but when I do do the following:

(Scala code)
val cv = new CrossValidation(100,1);
val trainingIndexes : Array[Array[Int]] = cv.train
val testingIndexes : Array[Array[Int]] = cv.test

The result is that trainingIndexes is an array with an empty array and the testingIndexes is an array with 1 element with 100 integers. My expectation was both lists containing 50 indexes.

The reason I was using it like this is that I want to do cross validation with 50% of the data (which consists of 100 data points) as training and 50% as testing. Am I doing it wrong, or is there an error in the CrossValidation implementation (since it accepts there parameters but gives no training indexes)

Build Failure

Trying to build using mvn package -Dmaven.test.skip=true:

I get the following error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project smile-data: Compilation failure: Compilation failure:
[ERROR] /home/brandy/Software/smile/SmileData/src/main/java/smile/data/parser/BinarySparseDatasetParser.java:[116,13] try-with-resources is not supported in -source 1.5
[ERROR] (use -source 7 or higher to enable try-with-resources)
[ERROR] /home/brandy/Software/smile/SmileData/src/main/java/smile/data/parser/DelimitedTextParser.java:[223,13] try-with-resources is not supported in -source 1.5
[ERROR] (use -source 7 or higher to enable try-with-resources)
[ERROR] /home/brandy/Software/smile/SmileData/src/main/java/smile/data/parser/SparseMatrixParser.java:[95,13] try-with-resources is not supported in -source 1.5
[ERROR] (use -source 7 or higher to enable try-with-resources)
[ERROR] /home/brandy/Software/smile/SmileData/src/main/java/smile/data/parser/ArffParser.java:[395,13] try-with-resources is not supported in -source 1.5
[ERROR] (use -source 7 or higher to enable try-with-resources)

I'm not a maven user so maybe I have an error in my setup but I was able to correct this by making a few changes:

  1. In pom.xml, changing the source and target maven-compiler-plugin values to 1.7
  2. Adding the following to SmileData/pom.xml and SmileNLP/pom.xml
    <parent>
        <groupId>com.github.haifengl</groupId>
        <artifactId>smile-all</artifactId>
        <version>1.0.3</version>
    </parent>

From my limited understanding it seems like the maven-compiler-plugin configuration isn't picked up in the submodules without the submodule referencing the parent module (and the source/target setting is currently set to 1.6).

If this is a proper fix, I can formulate a PR if you like.

RBFInterpolation NaN

Thanks for providing a Kriging and RBFInterpolation function!
I'm trying to use:
RBFInterpolation(coords, vals, new GaussianRadialBasis())
coords is a 2d double array of points in a x,y plane with a range (0,3000). vals is a 1d double array of values in the range (20-100). After initializing the RBFInterpolation I try to call interpolate(x,y) where x and y are in the range (0,3000) eg interpolate(153,331) and the result is NaN.

I noticed in your ScatterDemo code you swap the coordinates and you multiply it by a scaling factor:
rbf.interpolate(j_.12, i_12)

Do I need to swap my x and y coordinates when calling interpolate? What is the significance of the scaling factor?

NaiveBayes Priori Issue

In de constructor ' public Trainer(Model model, double[] priori, int p)' on line 211 in NaiveBayes, the sum is not properly added with the individual probs, causing the exception on line 228 to be always thrown.

Models are not serializable

Is there a way to save the trained models? It seems that all of the models do not support serialization at all.

RandomForest ArrayIndexOutOfBoundsException

Calling the constructor of smile.regression.RandomForest with numeric and nominal attribute types mixed, I get the following exception (randomly).

ArrayIndexOutOfBoundsException 3 smile.regression.RegressionTree$TrainNode.findBestSplit (RegressionTree.java:484)

I do not get exception with only numeric attributes.

ScatterPlot Class Too few legends / colors exception

In the Scatterplot implementation, when you add labels colors or legends, the implementation uses the maximum value of the labels, rather than the actual amount of labels.

I wonder if this implementation is intended this way, since this implies that the labels array should always contain values starting at 0 and increment by 1 to n (where n is the amount of labels one wants to use)

I'd suggest making a map that maps the unique values which you gain in line 129:
int[] id = Math.unique(y);

against the Colors and legends, and use these maps to gain the correct color, rather than using the label value for its color such as done now here on line 171:
g.setColor(palette[y[i]]);

I stumbled upon this with the following code giving the 'Too few legends exception' :

val exampleData = Array(Array(1.0,120.0),Array(2,123.0))
val exampleLabels = Array(1,2)
val exampleLegends = Array('@','*')

val plot =  ScatterPlot.plot(exampleData,exampleLabels,exampleLegends,Array( Color.blue, Color.green))

to elaborate

val exampleData = Array(Array(100.0,120.0),Array(200,123.0))
val exampleLabels = Array(100,200)
val exampleLegends = Array('@','*')

val plot =  ScatterPlot.plot(exampleData,exampleLabels,exampleLegends,Array( Color.blue, Color.green))

would require 201 colors and legend symbols. (instead of 2, for the 2 labels)

RSS Fingerprint indoor localization using neural and SVR algorithm

Dear Sir
I use your algorithm published in smile in regression SVR And RBF neural network in my thesis for enhance indoor localization WLAN .
and many thanks for your effort .
I built RSS Fingerprint and implement affinity propagation algorithm for cluster data and i used training RBF neural network in offline phase and in online phase i used ANN estimation for test new data

paper I use it as reference at link .
https://www.dropbox.com/home?preview=06554922.pdf

my question

in paper author used Affinity Propagation and RBF network as two phases for localization (coarse and fine), as shown in the figure below (from the paper).
untitled
My question is: how I can get the two techniques to work together in the online phase? How i can use the output of phase 1 (cluster matching) as input for phase 2 (ANN estimation)?
In other words, how I can tell the RBF network that its search must be restricted to cluster X (returned from cluster matching)?
what's your advise how I can do that?
thanks a lot

Is there an estimate of parallel/thread-safe Neural Net classifier?

Checking the source-code of Neuralnets I got this statement:

Note that this method is NOT multi-thread safe.

So, Is there an estimate of parallel/thread-safe Neural Net classifier?

By the way, congrats for this simple and elegant project/engine.
Really appreciated!

NLP - nullpointer exception when searching on simplecorpus

When I execute the following code the search method throws a nullpointer exception:
note, its scala

 val nlp = new SimpleCorpus()
 nlp.add("test","test2","The cake is a lie and tastes bad")
var possiblyEmptyIterator = nlp.search("good")

I had expected it to return an empty iterator, but instead it gives the following exception:

Exception in thread "main" java.lang.NullPointerException
at java.util.ArrayList.(ArrayList.java:164)
at smile.nlp.SimpleCorpus.search(SimpleCorpus.java:237)
at Main$.delayedEndpoint$Main$1(package.scala:31)
at Main$delayedInit$body.apply(package.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Main$.main(package.scala:9)
at Main.main(package.scala)

After looking into the SimpleCorpus I found why this happens: without checking you get the search term from the HashMap, thus if it is not there, the hashMap get call returns null, which is then fed into the constructor of the arrayList.

The same construction is used in all SimpleCorpus search methods.

My suggested fix would be to check before getting from the HashMap, if it contains the key. This way you do not need to store the (possible null) list in a local variable to check for null.

 public Iterator<Text> search(String term) {
        ArrayList<Text> hits = new ArrayList<Text>(invertedFile.get(term));
        return hits.iterator();
    }

then becomes

public Iterator<Text> search(String term) {
        if (invertedFile.containsKey(term))
        {
            ArrayList<Text> hits = new ArrayList<Text>(invertedFile.get(term));
            return hits.iterator();
        }
        else
        {
            return Collections.emptyIterator();
        }
    }

Disclaimer: I havent thoroughly tested the code!

Demo / DataSets

thanks for your work, looks interesting!

Could you publish your DataSets and publish the source code of the demo on github?

Greetings
Sebastian

Implementation of bag (ArrayIndexOutOfBoundsException)

In the implementation of Bag.java An ArrayIndexOutOfBoundsException will be created in case you do the following:

  1. Create a new bag with an array of any type, which has duplicate entries, whichoff some are at the end of the array, for example:
    ["test","is","a","feature","test"]
  2. call the feature method on a sequence of words that contains "test"

The reason this happens is because the this.features hashmap sets the index for test to 0, but later to 4, which is outside the bounds, because the features hashmap contains only 4 elements.

Possible fixes are:
a. Before putting a entry with index in the hashmap, check if it not already is contained in there.

b. Insert the entries backwards in the hashmap, such that the lowest index will lead.

c. Update the documentation to specify that only arrays with unique data can be used in training bags

d. alternative?

I'd be happy to implement any of these fixes, so please let me know which one you prefer, such that I can do a PR with the fix.

Kind regards,

Mike

Manifold Learning - Laplacian Eigenmap - Inverted Values

Hi,

I'm new in this topic of non linear dimension reduction and essentially, what I'm trying to do is taking advantage of the locally preservation of the Laplacian Eigenmap algorithm to find common features in an image, or get a structural representation of the image

It works as expected except for the issue that the output image has sometimes inverted values.

I've seen the manifold demo that you give and in the case of the example with the Swissroll data, it seems that sometimes the sign of the coordinates also changes.

Attached you could find a detailed description

https://docs.google.com/document/d/1YTY9rGkUJR5_Qfh9YM3pQmhVoSMj3OJIx8KOampk_aI/edit?usp=sharing

Thanks and looking forward your comments.

Handling of identical objects in DBScan

Hi,

First let me thank you for the great library, it is pleasure to use it.

In my work project I use DBScan class to cluster a bag of strings.
I have noticed that by default identical objects are ignored and not clustered together.

Please consider the following test:

@Test
public void testIdentical() {
    String[] dataset = new String[]{"a", "a", "a", "a", "b"};
    Distance<String> distance = new Distance<String>() {
        @Override
        public double d(String x, String y) {
            return x.equals(y) ? 0.0 : 1.0;
        }
    };

    DBScan<String> clusterer = new DBScan<>(dataset, distance, 2, 0.5);
    int[] labels = clusterer.getClusterLabel();
    assertTrue(labels[0] != DBScan.OUTLIER);
}

It is failing because strings "a" are not clustered together and are considered as outliers.
I believe this behavior is not intuitive.
The problem is not critical and I managed to workaround this with the following lines:

LinearSearch<String> search = new LinearSearch<>(allNames, JARO_WINKLER_DISTANCE);
search.setIdenticalExcluded(false);
DBScan<String> clusterer = new DBScan<>(allNames, search, MIN_PTS, RADIUS);

But nevertheless, do you think the default behavior is ok?
Should we at least provide constructor to DBScan to allow to create model with identical objects not excluded?

Thanks for your time.

How to encode a new entry in a double array?

For example, using the Car Evaluation Data Set of categorical features from UCI ML Repository:

val atts = new Array[Attribute](6)
atts(0) = new NominalAttribute("V1")
atts(1) = new NominalAttribute("V2")
atts(2) = new NominalAttribute("V3")
atts(3) = new NominalAttribute("V4")
atts(4) = new NominalAttribute("V5")
atts(5) = new NominalAttribute("V6")

val parser = new DelimitedTextParser()
parser.setDelimiter(",")
parser.setResponseIndex(new NominalAttribute("class"), 6)

val dataset = parser.parse("Test Dataset", atts, "src/main/resources/car.csv")

I'll have the data adequately represented in an AttributeDataset with double[] internally, so I can train a classifier with the data.

val data = dataset.toArray(new Array[Array[Double]](dataset.size()))
val labels = dataset.toArray(new Array[Int](dataset.size()))

val classifier = new DecisionTree(data, labels, 30)

Now, my question is how to make new data entries in a double[], staying in the same format as the AttributeDataset returned by DelimitedTextParser :: parse(String, Attribute[], String) method, so I can use it in the Classifier:: predict (double []) method?

val newData: Array[Double] = // vhigh,vhigh,2,more,small,med,unacc
classifier.predict(newData)

Smile AttributeDataset export to .arff

Hi,

Is there some easy way to export an AttributeDataset to .arff file format?

This could be a facility to develop in Smile and test versus Weka and another ML engines.

[]s!

How to deal with sparse features

I'm trying this library and the first results are great. But I have a problem that I need to use text to create a machine learning model.

The issue is that my double[][] features is going to be very large and sparse, because each position is going to be 1 if the token (word) is present in the text. The vocabulary can grow a lot.

How can I deal with this? I'm thinking about doing a Feature vectorization using the hashing trick. Is there another way?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.